Compare View

switch
from
...
to
 
Commits (7)
  • tangwang
     
  • tangwang
     
  • tangwang
     
  • tangwang
     
  • tangwang
     
  • `indexer/product_enrich.py`,不是再补一层判断。
    
    根因有两个:缓存 key 按内容复用,但缓存值里还带着旧商品的
    `id/title_input`;同时内部分析结果在历史上混用了 `tags` 和
    `enriched_tags`。这样一旦命中旧缓存,`build_index_content_fields()`
    会因为 `id` 对不上把结果丢掉,最后对外就变成全空。
    
    现在的处理是:
    - 内部分析结果统一用 `tags` 作为 LLM/缓存层字段。
    - 对外只在 `build_index_content_fields()` 封装时映射成
      `enriched_tags`,`enriched_attributes` 里也统一产出
    `name="enriched_tags"`。
    - 读取缓存时会先做归一化:把旧缓存里的 `enriched_tags` 兼容成内部
      `tags`,并把命中的缓存结果重绑到当前请求商品的 `id/title_input`。
    - 写缓存时也统一写成归一化后的内部结构,并且空内容不再写入缓存。
    tangwang
     
  • scripts/evaluation/eval_framework/constants.py:500 → 200
    Rebuild 里 rank <= recall_n 的 rerank_score: 1.0 仍按该 K 生效。
    2. LLM 批次上下限
    最少批次:DEFAULT_REBUILD_MIN_LLM_BATCHES 20 → 10
    最多批次:仍为 40(未改)
    3. 提前结束条件(_annotate_rebuild_batches)
    在已跑满 min_batches 之后,对每个批次:
    
    本批无 Exact(exact_n == 0),且满足其一即视为 bad batch:
    irrelevant_ratio >= 0.94
    或 (irrelevant + Low Relevant) / n >= 0.96(弱相关用 RELEVANCE_LOW)
    连续 2 个 bad batch 则 early stop(原先是连续 3 次、irrelevant >
    0.92)。
    
    批次日志里增加了 low_ratio、irrelevant_plus_low_ratio;rebuild
    元数据里增加了 rebuild_irrel_low_combined_stop_ratio。
    
    4. CLI
    --search-recall-top-k 说明改为默认 200
    --rebuild-min-batches 说明改为默认 10
    --rebuild-irrelevant-stop-ratio / --rebuild-irrelevant-stop-streak
    说明与新逻辑一致
    新增 --rebuild-irrel-low-combined-stop-ratio(默认 0.96)
    tangwang
     
docs/Usage-Guide.md
@@ -202,7 +202,7 @@ python -m pytest -q tests/test_rerank_client.py tests/test_es_query_builder.py t @@ -202,7 +202,7 @@ python -m pytest -q tests/test_rerank_client.py tests/test_es_query_builder.py t
202 ./scripts/service_ctl.sh restart backend 202 ./scripts/service_ctl.sh restart backend
203 sleep 3 203 sleep 3
204 ./scripts/service_ctl.sh status backend 204 ./scripts/service_ctl.sh status backend
205 -./scripts/evaluation/quick_start_eval.sh batch 205 +./scripts/evaluation/start_eval.sh.sh batch
206 ``` 206 ```
207 207
208 离线批量评估会把标注与报表写到 `artifacts/search_evaluation/`(SQLite、`batch_reports/` 下的 JSON/Markdown 等)。说明与命令见 [scripts/evaluation/README.md](../scripts/evaluation/README.md)。 208 离线批量评估会把标注与报表写到 `artifacts/search_evaluation/`(SQLite、`batch_reports/` 下的 JSON/Markdown 等)。说明与命令见 [scripts/evaluation/README.md](../scripts/evaluation/README.md)。
docs/issue-2026-03-31-评估框架-done-0331.md
@@ -138,7 +138,7 @@ queries默认是queries/queries.txt,填入左侧列表框,点击其中任何 @@ -138,7 +138,7 @@ queries默认是queries/queries.txt,填入左侧列表框,点击其中任何
138 138
139 139
140 @scripts/evaluation/README.md @scripts/evaluation/eval_framework/framework.py 140 @scripts/evaluation/README.md @scripts/evaluation/eval_framework/framework.py
141 -@quick_start_eval.sh (29-35) 141 +@start_eval.sh.sh (29-35)
142 请以如下流程为准,进行改造: 142 请以如下流程为准,进行改造:
143 如果重建的话,对每个query: 143 如果重建的话,对每个query:
144 每个搜索结果应该会扫描全库, 144 每个搜索结果应该会扫描全库,
docs/相关性检索优化说明.md
@@ -240,7 +240,7 @@ python -m pytest -q tests/test_rerank_client.py tests/test_es_query_builder.py t @@ -240,7 +240,7 @@ python -m pytest -q tests/test_rerank_client.py tests/test_es_query_builder.py t
240 ./scripts/service_ctl.sh restart backend 240 ./scripts/service_ctl.sh restart backend
241 sleep 3 241 sleep 3
242 ./scripts/service_ctl.sh status backend 242 ./scripts/service_ctl.sh status backend
243 -./scripts/evaluation/quick_start_eval.sh batch 243 +./scripts/evaluation/start_eval.sh.sh batch
244 ``` 244 ```
245 245
246 评估产物在 `artifacts/search_evaluation/`(如 `search_eval.sqlite3`、`batch_reports/` 下的 JSON/Markdown)。流程与参数说明见 [scripts/evaluation/README.md](../scripts/evaluation/README.md)。 246 评估产物在 `artifacts/search_evaluation/`(如 `search_eval.sqlite3`、`batch_reports/` 下的 JSON/Markdown)。流程与参数说明见 [scripts/evaluation/README.md](../scripts/evaluation/README.md)。
indexer/product_enrich.py
@@ -147,15 +147,40 @@ if _missing_prompt_langs: @@ -147,15 +147,40 @@ if _missing_prompt_langs:
147 # 多值字段分隔:英文逗号、中文逗号、顿号,及历史约定的 ; | / 与空白 147 # 多值字段分隔:英文逗号、中文逗号、顿号,及历史约定的 ; | / 与空白
148 _MULTI_VALUE_FIELD_SPLIT_RE = re.compile(r"[,、,;|/\n\t]+") 148 _MULTI_VALUE_FIELD_SPLIT_RE = re.compile(r"[,、,;|/\n\t]+")
149 _CORE_INDEX_LANGUAGES = ("zh", "en") 149 _CORE_INDEX_LANGUAGES = ("zh", "en")
150 -_ENRICHED_ATTRIBUTE_DIMENSIONS = (  
151 - "enriched_tags", 150 +_ANALYSIS_ATTRIBUTE_FIELD_MAP = (
  151 + ("tags", "enriched_tags"),
  152 + ("target_audience", "target_audience"),
  153 + ("usage_scene", "usage_scene"),
  154 + ("season", "season"),
  155 + ("key_attributes", "key_attributes"),
  156 + ("material", "material"),
  157 + ("features", "features"),
  158 +)
  159 +_ANALYSIS_RESULT_FIELDS = (
  160 + "title",
  161 + "category_path",
  162 + "tags",
  163 + "target_audience",
  164 + "usage_scene",
  165 + "season",
  166 + "key_attributes",
  167 + "material",
  168 + "features",
  169 + "anchor_text",
  170 +)
  171 +_ANALYSIS_MEANINGFUL_FIELDS = (
  172 + "tags",
152 "target_audience", 173 "target_audience",
153 "usage_scene", 174 "usage_scene",
154 "season", 175 "season",
155 "key_attributes", 176 "key_attributes",
156 "material", 177 "material",
157 "features", 178 "features",
  179 + "anchor_text",
158 ) 180 )
  181 +_ANALYSIS_FIELD_ALIASES = {
  182 + "tags": ("tags", "enriched_tags"),
  183 +}
159 184
160 185
161 def split_multi_value_field(text: Optional[str]) -> List[str]: 186 def split_multi_value_field(text: Optional[str]) -> List[str]:
@@ -195,25 +220,104 @@ def _append_enriched_attribute( @@ -195,25 +220,104 @@ def _append_enriched_attribute(
195 target.append({"name": name, "value": {lang: value}}) 220 target.append({"name": name, "value": {lang: value}})
196 221
197 222
  223 +def _get_product_id(product: Dict[str, Any]) -> str:
  224 + return str(product.get("id") or product.get("spu_id") or "").strip()
  225 +
  226 +
  227 +def _get_analysis_field_aliases(field_name: str) -> Tuple[str, ...]:
  228 + return _ANALYSIS_FIELD_ALIASES.get(field_name, (field_name,))
  229 +
  230 +
  231 +def _get_analysis_field_value(row: Dict[str, Any], field_name: str) -> Any:
  232 + for alias in _get_analysis_field_aliases(field_name):
  233 + if alias in row:
  234 + return row.get(alias)
  235 + return None
  236 +
  237 +
  238 +def _has_meaningful_value(value: Any) -> bool:
  239 + if value is None:
  240 + return False
  241 + if isinstance(value, str):
  242 + return bool(value.strip())
  243 + if isinstance(value, dict):
  244 + return any(_has_meaningful_value(v) for v in value.values())
  245 + if isinstance(value, list):
  246 + return any(_has_meaningful_value(v) for v in value)
  247 + return bool(value)
  248 +
  249 +
  250 +def _make_empty_analysis_result(
  251 + product: Dict[str, Any],
  252 + target_lang: str,
  253 + error: Optional[str] = None,
  254 +) -> Dict[str, Any]:
  255 + result = {
  256 + "id": _get_product_id(product),
  257 + "lang": target_lang,
  258 + "title_input": str(product.get("title") or "").strip(),
  259 + }
  260 + for field in _ANALYSIS_RESULT_FIELDS:
  261 + result[field] = ""
  262 + if error:
  263 + result["error"] = error
  264 + return result
  265 +
  266 +
  267 +def _normalize_analysis_result(
  268 + result: Dict[str, Any],
  269 + product: Dict[str, Any],
  270 + target_lang: str,
  271 +) -> Dict[str, Any]:
  272 + normalized = _make_empty_analysis_result(product, target_lang)
  273 + if not isinstance(result, dict):
  274 + return normalized
  275 +
  276 + normalized["lang"] = str(result.get("lang") or target_lang).strip() or target_lang
  277 + normalized["title"] = str(result.get("title") or "").strip()
  278 + normalized["category_path"] = str(result.get("category_path") or "").strip()
  279 + normalized["title_input"] = str(
  280 + product.get("title") or result.get("title_input") or ""
  281 + ).strip()
  282 +
  283 + for field in _ANALYSIS_RESULT_FIELDS:
  284 + if field in {"title", "category_path"}:
  285 + continue
  286 + normalized[field] = str(_get_analysis_field_value(result, field) or "").strip()
  287 +
  288 + if result.get("error"):
  289 + normalized["error"] = str(result.get("error"))
  290 + return normalized
  291 +
  292 +
  293 +def _has_meaningful_analysis_content(result: Dict[str, Any]) -> bool:
  294 + return any(_has_meaningful_value(result.get(field)) for field in _ANALYSIS_MEANINGFUL_FIELDS)
  295 +
  296 +
198 def _apply_index_content_row(result: Dict[str, Any], row: Dict[str, Any], lang: str) -> None: 297 def _apply_index_content_row(result: Dict[str, Any], row: Dict[str, Any], lang: str) -> None:
199 if not row or row.get("error"): 298 if not row or row.get("error"):
200 return 299 return
201 300
202 - anchor_text = str(row.get("anchor_text") or "").strip() 301 + anchor_text = str(_get_analysis_field_value(row, "anchor_text") or "").strip()
203 if anchor_text: 302 if anchor_text:
204 _append_lang_phrase_map(result["qanchors"], lang=lang, raw_value=anchor_text) 303 _append_lang_phrase_map(result["qanchors"], lang=lang, raw_value=anchor_text)
205 304
206 - for name in _ENRICHED_ATTRIBUTE_DIMENSIONS:  
207 - raw = row.get(name) 305 + for source_name, output_name in _ANALYSIS_ATTRIBUTE_FIELD_MAP:
  306 + raw = _get_analysis_field_value(row, source_name)
208 if not raw: 307 if not raw:
209 continue 308 continue
210 - _append_enriched_attribute(result["enriched_attributes"], name=name, lang=lang, raw_value=raw)  
211 - if name == "enriched_tags": 309 + _append_enriched_attribute(
  310 + result["enriched_attributes"],
  311 + name=output_name,
  312 + lang=lang,
  313 + raw_value=raw,
  314 + )
  315 + if output_name == "enriched_tags":
212 _append_lang_phrase_map(result["enriched_tags"], lang=lang, raw_value=raw) 316 _append_lang_phrase_map(result["enriched_tags"], lang=lang, raw_value=raw)
213 317
214 318
215 def _normalize_index_content_item(item: Dict[str, Any]) -> Dict[str, str]: 319 def _normalize_index_content_item(item: Dict[str, Any]) -> Dict[str, str]:
216 - item_id = str(item.get("id") or item.get("spu_id") or "").strip() 320 + item_id = _get_product_id(item)
217 return { 321 return {
218 "id": item_id, 322 "id": item_id,
219 "title": str(item.get("title") or "").strip(), 323 "title": str(item.get("title") or "").strip(),
@@ -369,7 +473,10 @@ def _get_cached_anchor_result( @@ -369,7 +473,10 @@ def _get_cached_anchor_result(
369 raw = _anchor_redis.get(key) 473 raw = _anchor_redis.get(key)
370 if not raw: 474 if not raw:
371 return None 475 return None
372 - return json.loads(raw) 476 + result = _normalize_analysis_result(json.loads(raw), product=product, target_lang=target_lang)
  477 + if not _has_meaningful_analysis_content(result):
  478 + return None
  479 + return result
373 except Exception as e: 480 except Exception as e:
374 logger.warning(f"Failed to get anchor cache: {e}") 481 logger.warning(f"Failed to get anchor cache: {e}")
375 return None 482 return None
@@ -383,9 +490,12 @@ def _set_cached_anchor_result( @@ -383,9 +490,12 @@ def _set_cached_anchor_result(
383 if not _anchor_redis: 490 if not _anchor_redis:
384 return 491 return
385 try: 492 try:
  493 + normalized = _normalize_analysis_result(result, product=product, target_lang=target_lang)
  494 + if not _has_meaningful_analysis_content(normalized):
  495 + return
386 key = _make_anchor_cache_key(product, target_lang) 496 key = _make_anchor_cache_key(product, target_lang)
387 ttl = ANCHOR_CACHE_EXPIRE_DAYS * 24 * 3600 497 ttl = ANCHOR_CACHE_EXPIRE_DAYS * 24 * 3600
388 - _anchor_redis.setex(key, ttl, json.dumps(result, ensure_ascii=False)) 498 + _anchor_redis.setex(key, ttl, json.dumps(normalized, ensure_ascii=False))
389 except Exception as e: 499 except Exception as e:
390 logger.warning(f"Failed to set anchor cache: {e}") 500 logger.warning(f"Failed to set anchor cache: {e}")
391 501
@@ -654,7 +764,7 @@ def parse_markdown_table(markdown_content: str) -&gt; List[Dict[str, str]]: @@ -654,7 +764,7 @@ def parse_markdown_table(markdown_content: str) -&gt; List[Dict[str, str]]:
654 "seq_no": parts[0], 764 "seq_no": parts[0],
655 "title": parts[1], # 商品标题(按目标语言) 765 "title": parts[1], # 商品标题(按目标语言)
656 "category_path": parts[2] if len(parts) > 2 else "", # 品类路径 766 "category_path": parts[2] if len(parts) > 2 else "", # 品类路径
657 - "enriched_tags": parts[3] if len(parts) > 3 else "", # 细分标签 767 + "tags": parts[3] if len(parts) > 3 else "", # 细分标签
658 "target_audience": parts[4] if len(parts) > 4 else "", # 适用人群 768 "target_audience": parts[4] if len(parts) > 4 else "", # 适用人群
659 "usage_scene": parts[5] if len(parts) > 5 else "", # 使用场景 769 "usage_scene": parts[5] if len(parts) > 5 else "", # 使用场景
660 "season": parts[6] if len(parts) > 6 else "", # 适用季节 770 "season": parts[6] if len(parts) > 6 else "", # 适用季节
@@ -705,7 +815,7 @@ def process_batch( @@ -705,7 +815,7 @@ def process_batch(
705 batch_data: List[Dict[str, str]], 815 batch_data: List[Dict[str, str]],
706 batch_num: int, 816 batch_num: int,
707 target_lang: str = "zh", 817 target_lang: str = "zh",
708 -) -> List[Dict[str, str]]: 818 +) -> List[Dict[str, Any]]:
709 """处理一个批次的数据""" 819 """处理一个批次的数据"""
710 logger.info(f"\n{'#' * 80}") 820 logger.info(f"\n{'#' * 80}")
711 logger.info(f"Processing Batch {batch_num} ({len(batch_data)} items)") 821 logger.info(f"Processing Batch {batch_num} ({len(batch_data)} items)")
@@ -725,22 +835,11 @@ def process_batch( @@ -725,22 +835,11 @@ def process_batch(
725 target_lang, 835 target_lang,
726 ) 836 )
727 return [ 837 return [
728 - {  
729 - "id": item["id"],  
730 - "lang": target_lang,  
731 - "title_input": item.get("title", ""),  
732 - "title": "",  
733 - "category_path": "",  
734 - "enriched_tags": "",  
735 - "target_audience": "",  
736 - "usage_scene": "",  
737 - "season": "",  
738 - "key_attributes": "",  
739 - "material": "",  
740 - "features": "",  
741 - "anchor_text": "",  
742 - "error": f"prompt_creation_failed: unsupported target_lang={target_lang}",  
743 - } 838 + _make_empty_analysis_result(
  839 + item,
  840 + target_lang,
  841 + error=f"prompt_creation_failed: unsupported target_lang={target_lang}",
  842 + )
744 for item in batch_data 843 for item in batch_data
745 ] 844 ]
746 845
@@ -764,24 +863,18 @@ def process_batch( @@ -764,24 +863,18 @@ def process_batch(
764 results_with_ids = [] 863 results_with_ids = []
765 for i, parsed_item in enumerate(parsed_results): 864 for i, parsed_item in enumerate(parsed_results):
766 if i < len(batch_data): 865 if i < len(batch_data):
767 - original_id = batch_data[i]["id"]  
768 - result = {  
769 - "id": original_id,  
770 - "lang": target_lang,  
771 - "title_input": batch_data[i]["title"], # 原始输入标题  
772 - "title": parsed_item.get("title", ""), # 模型生成的标题  
773 - "category_path": parsed_item.get("category_path", ""), # 品类路径  
774 - "enriched_tags": parsed_item.get("enriched_tags", ""), # 细分标签  
775 - "target_audience": parsed_item.get("target_audience", ""), # 适用人群  
776 - "usage_scene": parsed_item.get("usage_scene", ""), # 使用场景  
777 - "season": parsed_item.get("season", ""), # 适用季节  
778 - "key_attributes": parsed_item.get("key_attributes", ""), # 关键属性  
779 - "material": parsed_item.get("material", ""), # 材质说明  
780 - "features": parsed_item.get("features", ""), # 功能特点  
781 - "anchor_text": parsed_item.get("anchor_text", ""), # 锚文本  
782 - } 866 + source_product = batch_data[i]
  867 + result = _normalize_analysis_result(
  868 + parsed_item,
  869 + product=source_product,
  870 + target_lang=target_lang,
  871 + )
783 results_with_ids.append(result) 872 results_with_ids.append(result)
784 - logger.info(f"Mapped: seq={parsed_item['seq_no']} -> original_id={original_id}") 873 + logger.info(
  874 + "Mapped: seq=%s -> original_id=%s",
  875 + parsed_item.get("seq_no"),
  876 + source_product.get("id"),
  877 + )
785 878
786 # 保存批次 JSON 日志到独立文件 879 # 保存批次 JSON 日志到独立文件
787 batch_log = { 880 batch_log = {
@@ -808,22 +901,7 @@ def process_batch( @@ -808,22 +901,7 @@ def process_batch(
808 logger.error(f"Error processing batch {batch_num}: {str(e)}", exc_info=True) 901 logger.error(f"Error processing batch {batch_num}: {str(e)}", exc_info=True)
809 # 返回空结果,保持ID映射 902 # 返回空结果,保持ID映射
810 return [ 903 return [
811 - {  
812 - "id": item["id"],  
813 - "lang": target_lang,  
814 - "title_input": item["title"],  
815 - "title": "",  
816 - "category_path": "",  
817 - "enriched_tags": "",  
818 - "target_audience": "",  
819 - "usage_scene": "",  
820 - "season": "",  
821 - "key_attributes": "",  
822 - "material": "",  
823 - "features": "",  
824 - "anchor_text": "",  
825 - "error": str(e),  
826 - } 904 + _make_empty_analysis_result(item, target_lang, error=str(e))
827 for item in batch_data 905 for item in batch_data
828 ] 906 ]
829 907
scripts/evaluation/README.md
@@ -23,7 +23,7 @@ This directory holds the offline annotation builder, the evaluation web UI/API, @@ -23,7 +23,7 @@ This directory holds the offline annotation builder, the evaluation web UI/API,
23 | `fusion_experiments_round1.json` | Broader first-round experiments | 23 | `fusion_experiments_round1.json` | Broader first-round experiments |
24 | `queries/queries.txt` | Canonical evaluation queries | 24 | `queries/queries.txt` | Canonical evaluation queries |
25 | `README_Requirement.md` | Product/requirements reference | 25 | `README_Requirement.md` | Product/requirements reference |
26 -| `quick_start_eval.sh` | Wrapper: `batch`, `batch-rebuild` (deep `build` + `--force-refresh-labels`), or `serve` | 26 +| `start_eval.sh` | Wrapper: `batch`, `batch-rebuild` (deep `build` + `--force-refresh-labels`), or `serve` |
27 | `../start_eval_web.sh` | Same as `serve` with `activate.sh`; use `./scripts/service_ctl.sh start eval-web` (default port **6010**, override with `EVAL_WEB_PORT`). `./run.sh all` includes eval-web. | 27 | `../start_eval_web.sh` | Same as `serve` with `activate.sh`; use `./scripts/service_ctl.sh start eval-web` (default port **6010**, override with `EVAL_WEB_PORT`). `./run.sh all` includes eval-web. |
28 28
29 ## Quick start (repo root) 29 ## Quick start (repo root)
@@ -32,13 +32,13 @@ Set tenant if needed (`export TENANT_ID=163`). You need a live search API, DashS @@ -32,13 +32,13 @@ Set tenant if needed (`export TENANT_ID=163`). You need a live search API, DashS
32 32
33 ```bash 33 ```bash
34 # Batch: live search for every query; only uncached (query, spu_id) pairs hit the LLM 34 # Batch: live search for every query; only uncached (query, spu_id) pairs hit the LLM
35 -./scripts/evaluation/quick_start_eval.sh batch 35 +./scripts/evaluation/start_eval.sh batch
36 36
37 -# Deep rebuild: search recall top-500 (score 1) + full-corpus rerank outside pool + batched LLM (early stop; expensive)  
38 -./scripts/evaluation/quick_start_eval.sh batch-rebuild 37 +# Deep rebuild: per-query full corpus rerank (outside search recall pool) + LLM in batches along global sort order (early stop; expensive)
  38 +./scripts/evaluation/start_eval.sh batch-rebuild
39 39
40 # UI: http://127.0.0.1:6010/ 40 # UI: http://127.0.0.1:6010/
41 -./scripts/evaluation/quick_start_eval.sh serve 41 +./scripts/evaluation/start_eval.sh serve
42 # or: ./scripts/service_ctl.sh start eval-web 42 # or: ./scripts/service_ctl.sh start eval-web
43 ``` 43 ```
44 44
@@ -69,9 +69,36 @@ Explicit equivalents: @@ -69,9 +69,36 @@ Explicit equivalents:
69 --port 6010 69 --port 6010
70 ``` 70 ```
71 71
72 -Each `batch` run walks the full queries file. With `batch --force-refresh-labels`, every live top-`k` hit is re-judged by the LLM. 72 +Each `batch` run walks the full queries file and writes a **batch report** under `batch_reports/`. With `batch --force-refresh-labels`, every live top-`k` hit is re-judged by the LLM (still only those hits—not the deep rebuild pipeline).
73 73
74 -**Rebuild (`build --force-refresh-labels`):** For each query: take search top **500** as the recall pool (treated as rerank score **1**; those SKUs are not sent to the reranker). Rerank the rest of the tenant corpus; if more than **1000** non-pool docs have rerank score **> 0.5**, the query is **skipped** (logged as too easy / tail too relevant). Otherwise merge pool (search order) + non-pool (rerank score descending), then LLM-judge in batches of **50**, logging **exact_ratio** and **irrelevant_ratio** per batch. Stop after **3** consecutive batches with irrelevant_ratio **> 92%**, but only after at least **15** batches and at most **40** batches. 74 +### `start_eval.sh batch-rebuild` (deep annotation rebuild)
  75 +
  76 +This runs `build_annotation_set.py build` with **`--force-refresh-labels`** and **`--force-refresh-rerank`** (see the explicit command block below). It does **not** run the `batch` subcommand: there is **no** aggregate batch report for this step; outputs are per-query JSON under `query_builds/` plus updates in `search_eval.sqlite3`.
  77 +
  78 +For **each** query in `queries.txt`, in order:
  79 +
  80 +1. **Search recall** — Call the live search API with `size = max(--search-depth, --search-recall-top-k)` (the wrapper uses `--search-depth 500`). The first **`--search-recall-top-k`** hits (default **200**, see `eval_framework.constants.DEFAULT_SEARCH_RECALL_TOP_K`) form the **recall pool**; they are treated as rerank score **1** and are **not** sent to the reranker.
  81 +2. **Full corpus** — Load the tenant’s product corpus from Elasticsearch (same tenant as `TENANT_ID` / `--tenant-id`, default **163**), via `corpus_docs()` (cached in SQLite after the first load).
  82 +3. **Rerank outside pool** — Every corpus document whose `spu_id` is **not** in the pool is scored by the reranker API, **80 documents per request**. With `--force-refresh-rerank`, all those scores are recomputed and written to the **`rerank_scores`** table in `search_eval.sqlite3`. Without that flag, existing `(tenant_id, query, spu_id)` scores are reused and only missing rows hit the API.
  83 +4. **Skip “too easy” queries** — If more than **1000** non-pool documents have rerank score **> 0.5**, that query is **skipped** (one log line: tail too relevant / easy to satisfy). No LLM calls for that query.
  84 +5. **Global sort** — Order to label: pool in **search rank order**, then all remaining corpus docs in **descending rerank score** (dedupe by `spu_id`, pool wins).
  85 +6. **LLM labeling** — Walk that list **from the head** in batches of **50** by default (`--rebuild-llm-batch-size`). Each batch log includes **exact_ratio**, **irrelevant_ratio**, **low_ratio**, and **irrelevant_plus_low_ratio**.
  86 +
  87 + **Early stop** (defaults in `eval_framework.constants`; overridable via CLI):
  88 +
  89 + - Run **at least** `--rebuild-min-batches` batches (**10** by default) before any early stop is allowed.
  90 + - After that, define a **bad batch** as one where the batch has **no** **Exact Match** label **and** either:
  91 + - **Irrelevant** proportion **≥ 0.94** (`--rebuild-irrelevant-stop-ratio`), or
  92 + - **(Irrelevant + Low Relevant)** proportion **≥ 0.96** (`--rebuild-irrel-low-combined-stop-ratio`).
  93 + (“Low Relevant” is the weak tier; **High Relevant** does not count toward this combined ratio.)
  94 + - Count **consecutive** bad batches. **Reset** the count to 0 on any batch that is not bad.
  95 + - **Stop** when the consecutive bad count reaches **`--rebuild-irrelevant-stop-streak`** (**2** by default), or when **`--rebuild-max-batches`** (**40**) is reached—whichever comes first (up to **2000** docs per query at default batch size).
  96 +
  97 + So labeling follows best-first order but **stops early** when the model sees two consecutive “dead” batches; the tail may never be judged.
  98 +
  99 +**Incremental pool (no full rebuild):** `build_annotation_set.py build` **without** `--force-refresh-labels` uses the older windowed pool (`--annotate-search-top-k`, `--annotate-rerank-top-k`) and fills missing labels in one pass—no rerank-skip rule and no LLM early-stop loop.
  100 +
  101 +**Tuning the rebuild path:** `--search-recall-top-k`, `--rerank-high-threshold`, `--rerank-high-skip-count`, `--rebuild-llm-batch-size`, `--rebuild-min-batches`, `--rebuild-max-batches`, `--rebuild-irrelevant-stop-ratio`, `--rebuild-irrel-low-combined-stop-ratio`, `--rebuild-irrelevant-stop-streak` on `build` (see `eval_framework/cli.py`). Rerank API chunk size is **80** docs per request in code (`full_corpus_rerank_outside_exclude`).
75 102
76 ## Artifacts 103 ## Artifacts
77 104
@@ -95,7 +122,7 @@ Default root: `artifacts/search_evaluation/` @@ -95,7 +122,7 @@ Default root: `artifacts/search_evaluation/`
95 122
96 **Standard:** Run `batch` without `--force-refresh-labels` to extend coverage, then use the UI or batch in cached mode. Single-query evaluation defaults to **no** auto-annotation: recall still hits the live API; scoring uses SQLite only, and unlabeled hits count as `Irrelevant`. 123 **Standard:** Run `batch` without `--force-refresh-labels` to extend coverage, then use the UI or batch in cached mode. Single-query evaluation defaults to **no** auto-annotation: recall still hits the live API; scoring uses SQLite only, and unlabeled hits count as `Irrelevant`.
97 124
98 -**Incremental pool (no full rebuild):** `build_annotation_set.py build` without `--force-refresh-labels` merges search and full-corpus rerank windows before labeling (CLI `--search-depth`, `--rerank-depth`, `--annotate-*-top-k`). **Full rebuild** uses the recall-pool + rerank-skip + batched early-stop flow above; tune thresholds via `--search-recall-top-k`, `--rerank-high-threshold`, `--rerank-high-skip-count`, `--rebuild-*` flags on `build`. 125 +**Rebuild vs incremental `build`:** Deep rebuild is documented in the **`batch-rebuild`** subsection above. Incremental `build` (without `--force-refresh-labels`) uses `--annotate-search-top-k` / `--annotate-rerank-top-k` windows instead.
99 126
100 **Fusion tuning:** `tune_fusion.py` writes experiment configs, restarts the backend, runs batch evaluation, and optionally applies the best variant (see `--experiments-file`, `--score-metric`, `--apply-best`). 127 **Fusion tuning:** `tune_fusion.py` writes experiment configs, restarts the backend, runs batch evaluation, and optionally applies the best variant (see `--experiments-file`, `--score-metric`, `--apply-best`).
101 128
scripts/evaluation/eval_framework/__init__.py
@@ -12,15 +12,15 @@ ensure_project_on_path() @@ -12,15 +12,15 @@ ensure_project_on_path()
12 12
13 from .constants import ( # noqa: E402 13 from .constants import ( # noqa: E402
14 DEFAULT_ARTIFACT_ROOT, 14 DEFAULT_ARTIFACT_ROOT,
15 - DEFAULT_LABELER_MODE,  
16 DEFAULT_QUERY_FILE, 15 DEFAULT_QUERY_FILE,
17 - JUDGE_PROMPT_VERSION_COMPLEX,  
18 - JUDGE_PROMPT_VERSION_SIMPLE,  
19 PROJECT_ROOT, 16 PROJECT_ROOT,
20 RELEVANCE_EXACT, 17 RELEVANCE_EXACT,
  18 + RELEVANCE_HIGH,
21 RELEVANCE_IRRELEVANT, 19 RELEVANCE_IRRELEVANT,
22 - RELEVANCE_PARTIAL, 20 + RELEVANCE_LOW,
  21 + RELEVANCE_NON_IRRELEVANT,
23 VALID_LABELS, 22 VALID_LABELS,
  23 + normalize_stored_label,
24 ) 24 )
25 from .framework import SearchEvaluationFramework # noqa: E402 25 from .framework import SearchEvaluationFramework # noqa: E402
26 from .store import EvalStore, QueryBuildResult # noqa: E402 26 from .store import EvalStore, QueryBuildResult # noqa: E402
@@ -36,22 +36,22 @@ from .utils import ( # noqa: E402 @@ -36,22 +36,22 @@ from .utils import ( # noqa: E402
36 36
37 __all__ = [ 37 __all__ = [
38 "DEFAULT_ARTIFACT_ROOT", 38 "DEFAULT_ARTIFACT_ROOT",
39 - "DEFAULT_LABELER_MODE",  
40 "DEFAULT_QUERY_FILE", 39 "DEFAULT_QUERY_FILE",
41 "EvalStore", 40 "EvalStore",
42 - "JUDGE_PROMPT_VERSION_COMPLEX",  
43 - "JUDGE_PROMPT_VERSION_SIMPLE",  
44 "PROJECT_ROOT", 41 "PROJECT_ROOT",
45 "QueryBuildResult", 42 "QueryBuildResult",
46 "RELEVANCE_EXACT", 43 "RELEVANCE_EXACT",
  44 + "RELEVANCE_HIGH",
47 "RELEVANCE_IRRELEVANT", 45 "RELEVANCE_IRRELEVANT",
48 - "RELEVANCE_PARTIAL", 46 + "RELEVANCE_LOW",
  47 + "RELEVANCE_NON_IRRELEVANT",
49 "SearchEvaluationFramework", 48 "SearchEvaluationFramework",
50 "VALID_LABELS", 49 "VALID_LABELS",
51 "build_cli_parser", 50 "build_cli_parser",
52 "create_web_app", 51 "create_web_app",
53 "ensure_dir", 52 "ensure_dir",
54 "main", 53 "main",
  54 + "normalize_stored_label",
55 "render_batch_report_markdown", 55 "render_batch_report_markdown",
56 "sha1_text", 56 "sha1_text",
57 "utc_now_iso", 57 "utc_now_iso",
scripts/evaluation/eval_framework/cli.py
@@ -5,10 +5,11 @@ from __future__ import annotations @@ -5,10 +5,11 @@ from __future__ import annotations
5 import argparse 5 import argparse
6 import json 6 import json
7 from pathlib import Path 7 from pathlib import Path
  8 +from typing import Any, Dict
8 9
9 from .constants import ( 10 from .constants import (
10 - DEFAULT_LABELER_MODE,  
11 DEFAULT_QUERY_FILE, 11 DEFAULT_QUERY_FILE,
  12 + DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO,
12 DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO, 13 DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO,
13 DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK, 14 DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK,
14 DEFAULT_REBUILD_LLM_BATCH_SIZE, 15 DEFAULT_REBUILD_LLM_BATCH_SIZE,
@@ -23,6 +24,38 @@ from .utils import ensure_dir, utc_now_iso, utc_timestamp @@ -23,6 +24,38 @@ from .utils import ensure_dir, utc_now_iso, utc_timestamp
23 from .web_app import create_web_app 24 from .web_app import create_web_app
24 25
25 26
  27 +def add_judge_llm_args(p: argparse.ArgumentParser) -> None:
  28 + p.add_argument(
  29 + "--judge-model",
  30 + default=None,
  31 + metavar="MODEL",
  32 + help="Judge LLM model (default: eval_framework.constants.DEFAULT_JUDGE_MODEL).",
  33 + )
  34 + p.add_argument(
  35 + "--enable-thinking",
  36 + action=argparse.BooleanOptionalAction,
  37 + default=None,
  38 + help="enable_thinking for DashScope (default: DEFAULT_JUDGE_ENABLE_THINKING).",
  39 + )
  40 + p.add_argument(
  41 + "--dashscope-batch",
  42 + action=argparse.BooleanOptionalAction,
  43 + default=None,
  44 + help="DashScope Batch File API vs sync chat (default: DEFAULT_JUDGE_DASHSCOPE_BATCH).",
  45 + )
  46 +
  47 +
  48 +def framework_kwargs_from_args(args: argparse.Namespace) -> Dict[str, Any]:
  49 + kw: Dict[str, Any] = {}
  50 + if args.judge_model is not None:
  51 + kw["judge_model"] = args.judge_model
  52 + if args.enable_thinking is not None:
  53 + kw["enable_thinking"] = args.enable_thinking
  54 + if args.dashscope_batch is not None:
  55 + kw["use_dashscope_batch"] = args.dashscope_batch
  56 + return kw
  57 +
  58 +
26 def build_cli_parser() -> argparse.ArgumentParser: 59 def build_cli_parser() -> argparse.ArgumentParser:
27 parser = argparse.ArgumentParser(description="Search evaluation annotation builder and web UI") 60 parser = argparse.ArgumentParser(description="Search evaluation annotation builder and web UI")
28 sub = parser.add_subparsers(dest="command", required=True) 61 sub = parser.add_subparsers(dest="command", required=True)
@@ -38,7 +71,7 @@ def build_cli_parser() -&gt; argparse.ArgumentParser: @@ -38,7 +71,7 @@ def build_cli_parser() -&gt; argparse.ArgumentParser:
38 "--search-recall-top-k", 71 "--search-recall-top-k",
39 type=int, 72 type=int,
40 default=None, 73 default=None,
41 - help="Rebuild mode only: top-K search hits enter recall pool with score 1 (default when --force-refresh-labels: 500).", 74 + help="Rebuild mode only: top-K search hits enter recall pool with score 1 (default when --force-refresh-labels: 200).",
42 ) 75 )
43 build.add_argument( 76 build.add_argument(
44 "--rerank-high-threshold", 77 "--rerank-high-threshold",
@@ -53,24 +86,30 @@ def build_cli_parser() -&gt; argparse.ArgumentParser: @@ -53,24 +86,30 @@ def build_cli_parser() -&gt; argparse.ArgumentParser:
53 help="Rebuild only: skip query if more than this many non-pool docs have rerank score > threshold (default 1000).", 86 help="Rebuild only: skip query if more than this many non-pool docs have rerank score > threshold (default 1000).",
54 ) 87 )
55 build.add_argument("--rebuild-llm-batch-size", type=int, default=None, help="Rebuild only: LLM batch size (default 50).") 88 build.add_argument("--rebuild-llm-batch-size", type=int, default=None, help="Rebuild only: LLM batch size (default 50).")
56 - build.add_argument("--rebuild-min-batches", type=int, default=None, help="Rebuild only: min LLM batches before early stop (default 15).") 89 + build.add_argument("--rebuild-min-batches", type=int, default=None, help="Rebuild only: min LLM batches before early stop (default 10).")
57 build.add_argument("--rebuild-max-batches", type=int, default=None, help="Rebuild only: max LLM batches (default 40).") 90 build.add_argument("--rebuild-max-batches", type=int, default=None, help="Rebuild only: max LLM batches (default 40).")
58 build.add_argument( 91 build.add_argument(
59 "--rebuild-irrelevant-stop-ratio", 92 "--rebuild-irrelevant-stop-ratio",
60 type=float, 93 type=float,
61 default=None, 94 default=None,
62 - help="Rebuild only: irrelevant ratio above this counts toward early-stop streak (default 0.92).", 95 + help="Rebuild only: irrelevant-only branch threshold (>=) for early-stop streak, requires no Exact (default 0.94).",
  96 + )
  97 + build.add_argument(
  98 + "--rebuild-irrel-low-combined-stop-ratio",
  99 + type=float,
  100 + default=None,
  101 + help="Rebuild only: (irrelevant+low)/n threshold (>=) for early-stop streak, requires no Exact (default 0.96).",
63 ) 102 )
64 build.add_argument( 103 build.add_argument(
65 "--rebuild-irrelevant-stop-streak", 104 "--rebuild-irrelevant-stop-streak",
66 type=int, 105 type=int,
67 default=None, 106 default=None,
68 - help="Rebuild only: stop after this many consecutive batches above irrelevant ratio (default 3).", 107 + help="Rebuild only: consecutive bad batches before early stop (default 2).",
69 ) 108 )
70 build.add_argument("--language", default="en") 109 build.add_argument("--language", default="en")
71 build.add_argument("--force-refresh-rerank", action="store_true") 110 build.add_argument("--force-refresh-rerank", action="store_true")
72 build.add_argument("--force-refresh-labels", action="store_true") 111 build.add_argument("--force-refresh-labels", action="store_true")
73 - build.add_argument("--labeler-mode", default=DEFAULT_LABELER_MODE, choices=["simple", "complex"]) 112 + add_judge_llm_args(build)
74 113
75 batch = sub.add_parser("batch", help="Run batch evaluation against live search") 114 batch = sub.add_parser("batch", help="Run batch evaluation against live search")
76 batch.add_argument("--tenant-id", default="163") 115 batch.add_argument("--tenant-id", default="163")
@@ -78,7 +117,7 @@ def build_cli_parser() -&gt; argparse.ArgumentParser: @@ -78,7 +117,7 @@ def build_cli_parser() -&gt; argparse.ArgumentParser:
78 batch.add_argument("--top-k", type=int, default=100) 117 batch.add_argument("--top-k", type=int, default=100)
79 batch.add_argument("--language", default="en") 118 batch.add_argument("--language", default="en")
80 batch.add_argument("--force-refresh-labels", action="store_true") 119 batch.add_argument("--force-refresh-labels", action="store_true")
81 - batch.add_argument("--labeler-mode", default=DEFAULT_LABELER_MODE, choices=["simple", "complex"]) 120 + add_judge_llm_args(batch)
82 121
83 audit = sub.add_parser("audit", help="Audit annotation quality for queries") 122 audit = sub.add_parser("audit", help="Audit annotation quality for queries")
84 audit.add_argument("--tenant-id", default="163") 123 audit.add_argument("--tenant-id", default="163")
@@ -87,20 +126,20 @@ def build_cli_parser() -&gt; argparse.ArgumentParser: @@ -87,20 +126,20 @@ def build_cli_parser() -&gt; argparse.ArgumentParser:
87 audit.add_argument("--language", default="en") 126 audit.add_argument("--language", default="en")
88 audit.add_argument("--limit-suspicious", type=int, default=5) 127 audit.add_argument("--limit-suspicious", type=int, default=5)
89 audit.add_argument("--force-refresh-labels", action="store_true") 128 audit.add_argument("--force-refresh-labels", action="store_true")
90 - audit.add_argument("--labeler-mode", default=DEFAULT_LABELER_MODE, choices=["simple", "complex"]) 129 + add_judge_llm_args(audit)
91 130
92 serve = sub.add_parser("serve", help="Serve evaluation web UI on port 6010") 131 serve = sub.add_parser("serve", help="Serve evaluation web UI on port 6010")
93 serve.add_argument("--tenant-id", default="163") 132 serve.add_argument("--tenant-id", default="163")
94 serve.add_argument("--queries-file", default=str(DEFAULT_QUERY_FILE)) 133 serve.add_argument("--queries-file", default=str(DEFAULT_QUERY_FILE))
95 serve.add_argument("--host", default="0.0.0.0") 134 serve.add_argument("--host", default="0.0.0.0")
96 serve.add_argument("--port", type=int, default=6010) 135 serve.add_argument("--port", type=int, default=6010)
97 - serve.add_argument("--labeler-mode", default=DEFAULT_LABELER_MODE, choices=["simple", "complex"]) 136 + add_judge_llm_args(serve)
98 137
99 return parser 138 return parser
100 139
101 140
102 def run_build(args: argparse.Namespace) -> None: 141 def run_build(args: argparse.Namespace) -> None:
103 - framework = SearchEvaluationFramework(tenant_id=args.tenant_id, labeler_mode=args.labeler_mode) 142 + framework = SearchEvaluationFramework(tenant_id=args.tenant_id, **framework_kwargs_from_args(args))
104 queries = framework.queries_from_file(Path(args.queries_file)) 143 queries = framework.queries_from_file(Path(args.queries_file))
105 summary = [] 144 summary = []
106 rebuild_kwargs = {} 145 rebuild_kwargs = {}
@@ -115,6 +154,9 @@ def run_build(args: argparse.Namespace) -&gt; None: @@ -115,6 +154,9 @@ def run_build(args: argparse.Namespace) -&gt; None:
115 "rebuild_irrelevant_stop_ratio": args.rebuild_irrelevant_stop_ratio 154 "rebuild_irrelevant_stop_ratio": args.rebuild_irrelevant_stop_ratio
116 if args.rebuild_irrelevant_stop_ratio is not None 155 if args.rebuild_irrelevant_stop_ratio is not None
117 else DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO, 156 else DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO,
  157 + "rebuild_irrel_low_combined_stop_ratio": args.rebuild_irrel_low_combined_stop_ratio
  158 + if args.rebuild_irrel_low_combined_stop_ratio is not None
  159 + else DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO,
118 "rebuild_irrelevant_stop_streak": args.rebuild_irrelevant_stop_streak 160 "rebuild_irrelevant_stop_streak": args.rebuild_irrelevant_stop_streak
119 if args.rebuild_irrelevant_stop_streak is not None 161 if args.rebuild_irrelevant_stop_streak is not None
120 else DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK, 162 else DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK,
@@ -152,7 +194,7 @@ def run_build(args: argparse.Namespace) -&gt; None: @@ -152,7 +194,7 @@ def run_build(args: argparse.Namespace) -&gt; None:
152 194
153 195
154 def run_batch(args: argparse.Namespace) -> None: 196 def run_batch(args: argparse.Namespace) -> None:
155 - framework = SearchEvaluationFramework(tenant_id=args.tenant_id, labeler_mode=args.labeler_mode) 197 + framework = SearchEvaluationFramework(tenant_id=args.tenant_id, **framework_kwargs_from_args(args))
156 queries = framework.queries_from_file(Path(args.queries_file)) 198 queries = framework.queries_from_file(Path(args.queries_file))
157 payload = framework.batch_evaluate( 199 payload = framework.batch_evaluate(
158 queries=queries, 200 queries=queries,
@@ -165,7 +207,7 @@ def run_batch(args: argparse.Namespace) -&gt; None: @@ -165,7 +207,7 @@ def run_batch(args: argparse.Namespace) -&gt; None:
165 207
166 208
167 def run_audit(args: argparse.Namespace) -> None: 209 def run_audit(args: argparse.Namespace) -> None:
168 - framework = SearchEvaluationFramework(tenant_id=args.tenant_id, labeler_mode=args.labeler_mode) 210 + framework = SearchEvaluationFramework(tenant_id=args.tenant_id, **framework_kwargs_from_args(args))
169 queries = framework.queries_from_file(Path(args.queries_file)) 211 queries = framework.queries_from_file(Path(args.queries_file))
170 audit_items = [] 212 audit_items = []
171 for query in queries: 213 for query in queries:
@@ -215,7 +257,7 @@ def run_audit(args: argparse.Namespace) -&gt; None: @@ -215,7 +257,7 @@ def run_audit(args: argparse.Namespace) -&gt; None:
215 257
216 258
217 def run_serve(args: argparse.Namespace) -> None: 259 def run_serve(args: argparse.Namespace) -> None:
218 - framework = SearchEvaluationFramework(tenant_id=args.tenant_id, labeler_mode=args.labeler_mode) 260 + framework = SearchEvaluationFramework(tenant_id=args.tenant_id, **framework_kwargs_from_args(args))
219 app = create_web_app(framework, Path(args.queries_file)) 261 app = create_web_app(framework, Path(args.queries_file))
220 import uvicorn 262 import uvicorn
221 263
scripts/evaluation/eval_framework/clients.py
@@ -2,30 +2,49 @@ @@ -2,30 +2,49 @@
2 2
3 from __future__ import annotations 3 from __future__ import annotations
4 4
  5 +import io
  6 +import json
  7 +import time
  8 +import uuid
5 from typing import Any, Dict, List, Optional, Sequence, Tuple 9 from typing import Any, Dict, List, Optional, Sequence, Tuple
6 10
7 import requests 11 import requests
8 12
9 from .constants import VALID_LABELS 13 from .constants import VALID_LABELS
10 -from .prompts import (  
11 - classify_batch_complex_prompt,  
12 - classify_batch_simple_prompt,  
13 - extract_query_profile_prompt,  
14 -) 14 +from .prompts import classify_prompt
15 from .utils import build_label_doc_line, extract_json_blob, safe_json_dumps 15 from .utils import build_label_doc_line, extract_json_blob, safe_json_dumps
16 16
17 17
  18 +def _canonicalize_judge_label(raw: str) -> str | None:
  19 + s = str(raw or "").strip().strip('"').strip("'")
  20 + if s in VALID_LABELS:
  21 + return s
  22 + low = s.lower()
  23 + for v in VALID_LABELS:
  24 + if v.lower() == low:
  25 + return v
  26 + return None
  27 +
  28 +
18 class SearchServiceClient: 29 class SearchServiceClient:
19 def __init__(self, base_url: str, tenant_id: str): 30 def __init__(self, base_url: str, tenant_id: str):
20 self.base_url = base_url.rstrip("/") 31 self.base_url = base_url.rstrip("/")
21 self.tenant_id = str(tenant_id) 32 self.tenant_id = str(tenant_id)
22 self.session = requests.Session() 33 self.session = requests.Session()
23 34
24 - def search(self, query: str, size: int, from_: int = 0, language: str = "en") -> Dict[str, Any]: 35 + def search(self, query: str, size: int, from_: int = 0, language: str = "en", *, debug: bool = False) -> Dict[str, Any]:
  36 + payload: Dict[str, Any] = {
  37 + "query": query,
  38 + "size": size,
  39 + "from": from_,
  40 + "language": language,
  41 + }
  42 + if debug:
  43 + payload["debug"] = True
25 response = self.session.post( 44 response = self.session.post(
26 f"{self.base_url}/search/", 45 f"{self.base_url}/search/",
27 headers={"Content-Type": "application/json", "X-Tenant-ID": self.tenant_id}, 46 headers={"Content-Type": "application/json", "X-Tenant-ID": self.tenant_id},
28 - json={"query": query, "size": size, "from": from_, "language": language}, 47 + json=payload,
29 timeout=120, 48 timeout=120,
30 ) 49 )
31 response.raise_for_status() 50 response.raise_for_status()
@@ -52,26 +71,55 @@ class RerankServiceClient: @@ -52,26 +71,55 @@ class RerankServiceClient:
52 71
53 72
54 class DashScopeLabelClient: 73 class DashScopeLabelClient:
55 - def __init__(self, model: str, base_url: str, api_key: str, batch_size: int = 40): 74 + """DashScope OpenAI-compatible chat: synchronous or Batch File API (JSONL job).
  75 +
  76 + Batch flow: https://help.aliyun.com/zh/model-studio/batch-interfaces-compatible-with-openai/
  77 +
  78 + Some regional endpoints (e.g. ``dashscope-us`` compatible-mode) do not implement ``/batches``;
  79 + on HTTP 404 from batch calls we fall back to synchronous ``/chat/completions`` and stop using batch
  80 + for subsequent requests on this client.
  81 + """
  82 +
  83 + def __init__(
  84 + self,
  85 + model: str,
  86 + base_url: str,
  87 + api_key: str,
  88 + batch_size: int = 40,
  89 + *,
  90 + batch_completion_window: str = "24h",
  91 + batch_poll_interval_sec: float = 10.0,
  92 + enable_thinking: bool = True,
  93 + use_batch: bool = False,
  94 + ):
56 self.model = model 95 self.model = model
57 self.base_url = base_url.rstrip("/") 96 self.base_url = base_url.rstrip("/")
58 self.api_key = api_key 97 self.api_key = api_key
59 self.batch_size = int(batch_size) 98 self.batch_size = int(batch_size)
  99 + self.batch_completion_window = str(batch_completion_window)
  100 + self.batch_poll_interval_sec = float(batch_poll_interval_sec)
  101 + self.enable_thinking = bool(enable_thinking)
  102 + self.use_batch = bool(use_batch)
60 self.session = requests.Session() 103 self.session = requests.Session()
61 104
62 - def _chat(self, prompt: str) -> Tuple[str, str]: 105 + def _auth_headers(self) -> Dict[str, str]:
  106 + return {"Authorization": f"Bearer {self.api_key}"}
  107 +
  108 + def _completion_body(self, prompt: str) -> Dict[str, Any]:
  109 + body: Dict[str, Any] = {
  110 + "model": self.model,
  111 + "messages": [{"role": "user", "content": prompt}],
  112 + "temperature": 0,
  113 + "top_p": 0.1,
  114 + "enable_thinking": self.enable_thinking,
  115 + }
  116 + return body
  117 +
  118 + def _chat_sync(self, prompt: str) -> Tuple[str, str]:
63 response = self.session.post( 119 response = self.session.post(
64 f"{self.base_url}/chat/completions", 120 f"{self.base_url}/chat/completions",
65 - headers={  
66 - "Authorization": f"Bearer {self.api_key}",  
67 - "Content-Type": "application/json",  
68 - },  
69 - json={  
70 - "model": self.model,  
71 - "messages": [{"role": "user", "content": prompt}],  
72 - "temperature": 0,  
73 - "top_p": 0.1,  
74 - }, 121 + headers={**self._auth_headers(), "Content-Type": "application/json"},
  122 + json=self._completion_body(prompt),
75 timeout=180, 123 timeout=180,
76 ) 124 )
77 response.raise_for_status() 125 response.raise_for_status()
@@ -79,71 +127,146 @@ class DashScopeLabelClient: @@ -79,71 +127,146 @@ class DashScopeLabelClient:
79 content = str(((data.get("choices") or [{}])[0].get("message") or {}).get("content") or "").strip() 127 content = str(((data.get("choices") or [{}])[0].get("message") or {}).get("content") or "").strip()
80 return content, safe_json_dumps(data) 128 return content, safe_json_dumps(data)
81 129
82 - def classify_batch_simple( 130 + def _chat_batch(self, prompt: str) -> Tuple[str, str]:
  131 + """One chat completion via Batch File API (single-line JSONL job)."""
  132 + custom_id = uuid.uuid4().hex
  133 + body = self._completion_body(prompt)
  134 + line_obj = {
  135 + "custom_id": custom_id,
  136 + "method": "POST",
  137 + "url": "/v1/chat/completions",
  138 + "body": body,
  139 + }
  140 + jsonl = json.dumps(line_obj, ensure_ascii=False, separators=(",", ":")) + "\n"
  141 + auth = self._auth_headers()
  142 +
  143 + up = self.session.post(
  144 + f"{self.base_url}/files",
  145 + headers=auth,
  146 + files={
  147 + "file": (
  148 + "eval_batch_input.jsonl",
  149 + io.BytesIO(jsonl.encode("utf-8")),
  150 + "application/octet-stream",
  151 + )
  152 + },
  153 + data={"purpose": "batch"},
  154 + timeout=300,
  155 + )
  156 + up.raise_for_status()
  157 + file_id = (up.json() or {}).get("id")
  158 + if not file_id:
  159 + raise RuntimeError(f"DashScope file upload returned no id: {up.text!r}")
  160 +
  161 + cr = self.session.post(
  162 + f"{self.base_url}/batches",
  163 + headers={**auth, "Content-Type": "application/json"},
  164 + json={
  165 + "input_file_id": file_id,
  166 + "endpoint": "/v1/chat/completions",
  167 + "completion_window": self.batch_completion_window,
  168 + },
  169 + timeout=120,
  170 + )
  171 + cr.raise_for_status()
  172 + batch_payload = cr.json() or {}
  173 + batch_id = batch_payload.get("id")
  174 + if not batch_id:
  175 + raise RuntimeError(f"DashScope batches.create returned no id: {cr.text!r}")
  176 +
  177 + terminal = frozenset({"completed", "failed", "expired", "cancelled"})
  178 + batch: Dict[str, Any] = dict(batch_payload)
  179 + status = str(batch.get("status") or "")
  180 + while status not in terminal:
  181 + time.sleep(self.batch_poll_interval_sec)
  182 + br = self.session.get(f"{self.base_url}/batches/{batch_id}", headers=auth, timeout=120)
  183 + br.raise_for_status()
  184 + batch = br.json() or {}
  185 + status = str(batch.get("status") or "")
  186 +
  187 + if status != "completed":
  188 + raise RuntimeError(
  189 + f"DashScope batch {batch_id} ended with status={status!r} errors={batch.get('errors')!r}"
  190 + )
  191 +
  192 + out_id = batch.get("output_file_id")
  193 + err_id = batch.get("error_file_id")
  194 +
  195 + row = self._find_batch_line_for_custom_id(out_id, custom_id, auth)
  196 + if row is None:
  197 + err_row = self._find_batch_line_for_custom_id(err_id, custom_id, auth)
  198 + if err_row is not None:
  199 + raise RuntimeError(f"DashScope batch request failed: {err_row!r}")
  200 + raise RuntimeError(f"DashScope batch output missing custom_id={custom_id!r}")
  201 +
  202 + resp = row.get("response") or {}
  203 + sc = resp.get("status_code")
  204 + if sc is not None and int(sc) != 200:
  205 + raise RuntimeError(f"DashScope batch line error: {row!r}")
  206 +
  207 + data = resp.get("body") or {}
  208 + content = str(((data.get("choices") or [{}])[0].get("message") or {}).get("content") or "").strip()
  209 + return content, safe_json_dumps(row)
  210 +
  211 + def _chat(self, prompt: str) -> Tuple[str, str]:
  212 + if not self.use_batch:
  213 + return self._chat_sync(prompt)
  214 + try:
  215 + return self._chat_batch(prompt)
  216 + except requests.exceptions.HTTPError as e:
  217 + resp = getattr(e, "response", None)
  218 + if resp is not None and resp.status_code == 404:
  219 + self.use_batch = False
  220 + return self._chat_sync(prompt)
  221 + raise
  222 +
  223 + def _find_batch_line_for_custom_id(
  224 + self,
  225 + file_id: Optional[str],
  226 + custom_id: str,
  227 + auth: Dict[str, str],
  228 + ) -> Optional[Dict[str, Any]]:
  229 + if not file_id or str(file_id) in ("null", ""):
  230 + return None
  231 + r = self.session.get(f"{self.base_url}/files/{file_id}/content", headers=auth, timeout=300)
  232 + r.raise_for_status()
  233 + for raw in r.text.splitlines():
  234 + raw = raw.strip()
  235 + if not raw:
  236 + continue
  237 + try:
  238 + obj = json.loads(raw)
  239 + except json.JSONDecodeError:
  240 + continue
  241 + if str(obj.get("custom_id")) == custom_id:
  242 + return obj
  243 + return None
  244 +
  245 + def classify_batch(
83 self, 246 self,
84 query: str, 247 query: str,
85 docs: Sequence[Dict[str, Any]], 248 docs: Sequence[Dict[str, Any]],
86 ) -> Tuple[List[str], str]: 249 ) -> Tuple[List[str], str]:
87 numbered_docs = [build_label_doc_line(idx + 1, doc) for idx, doc in enumerate(docs)] 250 numbered_docs = [build_label_doc_line(idx + 1, doc) for idx, doc in enumerate(docs)]
88 - prompt = classify_batch_simple_prompt(query, numbered_docs) 251 + prompt = classify_prompt(query, numbered_docs)
89 content, raw_response = self._chat(prompt) 252 content, raw_response = self._chat(prompt)
90 - labels = [] 253 + labels: List[str] = []
91 for line in str(content or "").splitlines(): 254 for line in str(content or "").splitlines():
92 - label = line.strip()  
93 - if label in VALID_LABELS:  
94 - labels.append(label) 255 + canon = _canonicalize_judge_label(line)
  256 + if canon is not None:
  257 + labels.append(canon)
95 if len(labels) != len(docs): 258 if len(labels) != len(docs):
96 payload = extract_json_blob(content) 259 payload = extract_json_blob(content)
97 if isinstance(payload, dict) and isinstance(payload.get("labels"), list): 260 if isinstance(payload, dict) and isinstance(payload.get("labels"), list):
98 labels = [] 261 labels = []
99 for item in payload["labels"][: len(docs)]: 262 for item in payload["labels"][: len(docs)]:
100 if isinstance(item, dict): 263 if isinstance(item, dict):
101 - label = str(item.get("label") or "").strip() 264 + raw_l = str(item.get("label") or "").strip()
102 else: 265 else:
103 - label = str(item).strip()  
104 - if label in VALID_LABELS:  
105 - labels.append(label)  
106 - if len(labels) != len(docs) or any(label not in VALID_LABELS for label in labels):  
107 - raise ValueError(f"unexpected simple label output: {content!r}")  
108 - return labels, raw_response  
109 -  
110 - def extract_query_profile(  
111 - self,  
112 - query: str,  
113 - parser_hints: Dict[str, Any],  
114 - ) -> Tuple[Dict[str, Any], str]:  
115 - prompt = extract_query_profile_prompt(query, parser_hints)  
116 - content, raw_response = self._chat(prompt)  
117 - payload = extract_json_blob(content)  
118 - if not isinstance(payload, dict):  
119 - raise ValueError(f"unexpected query profile payload: {content!r}")  
120 - payload.setdefault("normalized_query_en", query)  
121 - payload.setdefault("primary_category", "")  
122 - payload.setdefault("allowed_categories", [])  
123 - payload.setdefault("required_attributes", [])  
124 - payload.setdefault("notes", [])  
125 - return payload, raw_response  
126 -  
127 - def classify_batch_complex(  
128 - self,  
129 - query: str,  
130 - query_profile: Dict[str, Any],  
131 - docs: Sequence[Dict[str, Any]],  
132 - ) -> Tuple[List[str], str]:  
133 - numbered_docs = [build_label_doc_line(idx + 1, doc) for idx, doc in enumerate(docs)]  
134 - prompt = classify_batch_complex_prompt(query, query_profile, numbered_docs)  
135 - content, raw_response = self._chat(prompt)  
136 - payload = extract_json_blob(content)  
137 - if not isinstance(payload, dict) or not isinstance(payload.get("labels"), list):  
138 - raise ValueError(f"unexpected label payload: {content!r}")  
139 - labels_payload = payload["labels"]  
140 - labels: List[str] = []  
141 - for item in labels_payload[: len(docs)]:  
142 - if not isinstance(item, dict):  
143 - continue  
144 - label = str(item.get("label") or "").strip()  
145 - if label in VALID_LABELS:  
146 - labels.append(label) 266 + raw_l = str(item).strip()
  267 + canon = _canonicalize_judge_label(raw_l)
  268 + if canon is not None:
  269 + labels.append(canon)
147 if len(labels) != len(docs) or any(label not in VALID_LABELS for label in labels): 270 if len(labels) != len(docs) or any(label not in VALID_LABELS for label in labels):
148 - raise ValueError(f"unexpected label output: {content!r}") 271 + raise ValueError(f"unexpected classify output: {content!r}")
149 return labels, raw_response 272 return labels, raw_response
scripts/evaluation/eval_framework/constants.py
@@ -6,24 +6,60 @@ _PKG_DIR = Path(__file__).resolve().parent @@ -6,24 +6,60 @@ _PKG_DIR = Path(__file__).resolve().parent
6 _SCRIPTS_EVAL_DIR = _PKG_DIR.parent 6 _SCRIPTS_EVAL_DIR = _PKG_DIR.parent
7 PROJECT_ROOT = _SCRIPTS_EVAL_DIR.parents[1] 7 PROJECT_ROOT = _SCRIPTS_EVAL_DIR.parents[1]
8 8
9 -RELEVANCE_EXACT = "Exact"  
10 -RELEVANCE_PARTIAL = "Partial" 9 +# Canonical English labels (must match LLM prompt output in prompts._CLASSIFY_TEMPLATE_EN)
  10 +RELEVANCE_EXACT = "Exact Match"
  11 +RELEVANCE_HIGH = "High Relevant"
  12 +RELEVANCE_LOW = "Low Relevant"
11 RELEVANCE_IRRELEVANT = "Irrelevant" 13 RELEVANCE_IRRELEVANT = "Irrelevant"
12 -VALID_LABELS = {RELEVANCE_EXACT, RELEVANCE_PARTIAL, RELEVANCE_IRRELEVANT} 14 +
  15 +VALID_LABELS = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_LOW, RELEVANCE_IRRELEVANT})
  16 +
  17 +# Precision / MAP "positive" set (all non-irrelevant tiers)
  18 +RELEVANCE_NON_IRRELEVANT = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_LOW})
  19 +
  20 +_LEGACY_LABEL_MAP = {
  21 + "Exact": RELEVANCE_EXACT,
  22 + "Partial": RELEVANCE_HIGH,
  23 +}
  24 +
  25 +
  26 +def normalize_stored_label(label: str) -> str:
  27 + """Map legacy 3-way SQLite labels to current 4-way strings; pass through canonical labels."""
  28 + s = str(label).strip()
  29 + if s in VALID_LABELS:
  30 + return s
  31 + return _LEGACY_LABEL_MAP.get(s, s)
  32 +
13 33
14 DEFAULT_ARTIFACT_ROOT = PROJECT_ROOT / "artifacts" / "search_evaluation" 34 DEFAULT_ARTIFACT_ROOT = PROJECT_ROOT / "artifacts" / "search_evaluation"
15 DEFAULT_QUERY_FILE = _SCRIPTS_EVAL_DIR / "queries" / "queries.txt" 35 DEFAULT_QUERY_FILE = _SCRIPTS_EVAL_DIR / "queries" / "queries.txt"
16 36
17 -JUDGE_PROMPT_VERSION_SIMPLE = "v3_simple_20260331"  
18 -JUDGE_PROMPT_VERSION_COMPLEX = "v2_structured_20260331"  
19 -DEFAULT_LABELER_MODE = "simple" 37 +# Judge LLM (eval_framework only; override via CLI --judge-model / constructor kwargs)
  38 +DEFAULT_JUDGE_MODEL = "qwen3.5-flash"
  39 +DEFAULT_JUDGE_ENABLE_THINKING = True
  40 +DEFAULT_JUDGE_DASHSCOPE_BATCH = False
  41 +DEFAULT_JUDGE_BATCH_COMPLETION_WINDOW = "24h"
  42 +DEFAULT_JUDGE_BATCH_POLL_INTERVAL_SEC = 10.0
20 43
21 -# Rebuild annotation pool (build --force-refresh-labels): search recall + full-corpus rerank + LLM batches  
22 -DEFAULT_SEARCH_RECALL_TOP_K = 500 44 +# --- Rebuild annotation pool (``build --force-refresh-labels``) ---
  45 +# Flow: search recall pool (rerank_score=1, no rerank API) + rerank rest of corpus +
  46 +# LLM labels in fixed-size batches along global order (see ``framework._annotate_rebuild_batches``).
  47 +DEFAULT_SEARCH_RECALL_TOP_K = 200
23 DEFAULT_RERANK_HIGH_THRESHOLD = 0.5 48 DEFAULT_RERANK_HIGH_THRESHOLD = 0.5
24 DEFAULT_RERANK_HIGH_SKIP_COUNT = 1000 49 DEFAULT_RERANK_HIGH_SKIP_COUNT = 1000
25 DEFAULT_REBUILD_LLM_BATCH_SIZE = 50 50 DEFAULT_REBUILD_LLM_BATCH_SIZE = 50
26 -DEFAULT_REBUILD_MIN_LLM_BATCHES = 15 51 +# At least this many LLM batches run before early-stop is considered.
  52 +DEFAULT_REBUILD_MIN_LLM_BATCHES = 10
  53 +# Hard cap on LLM batches per query (each batch labels up to ``DEFAULT_REBUILD_LLM_BATCH_SIZE`` docs).
27 DEFAULT_REBUILD_MAX_LLM_BATCHES = 40 54 DEFAULT_REBUILD_MAX_LLM_BATCHES = 40
28 -DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO = 0.92  
29 -DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK = 3 55 +
  56 +# LLM early-stop (only after ``DEFAULT_REBUILD_MIN_LLM_BATCHES`` completed):
  57 +# A batch is "bad" when it has **no** ``Exact Match`` label AND either:
  58 +# - irrelevant_ratio >= DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO, or
  59 +# - (Irrelevant + Low Relevant) / n >= DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO.
  60 +# ``irrelevant_ratio`` = Irrelevant count / n; weak relevance is ``RELEVANCE_LOW`` ("Low Relevant").
  61 +# If a batch is bad, increment a streak; otherwise reset streak to 0. Stop when streak reaches
  62 +# ``DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK`` (consecutive bad batches).
  63 +DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO = 0.94
  64 +DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO = 0.96
  65 +DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK = 2
scripts/evaluation/eval_framework/framework.py
@@ -10,13 +10,18 @@ from typing import Any, Dict, List, Sequence, Tuple @@ -10,13 +10,18 @@ from typing import Any, Dict, List, Sequence, Tuple
10 import requests 10 import requests
11 from elasticsearch.helpers import scan 11 from elasticsearch.helpers import scan
12 12
13 -from api.app import get_app_config, get_es_client, get_query_parser, init_service 13 +from api.app import get_app_config, get_es_client, init_service
14 from indexer.mapping_generator import get_tenant_index_name 14 from indexer.mapping_generator import get_tenant_index_name
15 15
16 from .clients import DashScopeLabelClient, RerankServiceClient, SearchServiceClient 16 from .clients import DashScopeLabelClient, RerankServiceClient, SearchServiceClient
17 from .constants import ( 17 from .constants import (
18 DEFAULT_ARTIFACT_ROOT, 18 DEFAULT_ARTIFACT_ROOT,
19 - DEFAULT_LABELER_MODE, 19 + DEFAULT_JUDGE_BATCH_COMPLETION_WINDOW,
  20 + DEFAULT_JUDGE_BATCH_POLL_INTERVAL_SEC,
  21 + DEFAULT_JUDGE_DASHSCOPE_BATCH,
  22 + DEFAULT_JUDGE_ENABLE_THINKING,
  23 + DEFAULT_JUDGE_MODEL,
  24 + DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO,
20 DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO, 25 DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO,
21 DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK, 26 DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK,
22 DEFAULT_REBUILD_LLM_BATCH_SIZE, 27 DEFAULT_REBUILD_LLM_BATCH_SIZE,
@@ -25,10 +30,11 @@ from .constants import ( @@ -25,10 +30,11 @@ from .constants import (
25 DEFAULT_RERANK_HIGH_SKIP_COUNT, 30 DEFAULT_RERANK_HIGH_SKIP_COUNT,
26 DEFAULT_RERANK_HIGH_THRESHOLD, 31 DEFAULT_RERANK_HIGH_THRESHOLD,
27 DEFAULT_SEARCH_RECALL_TOP_K, 32 DEFAULT_SEARCH_RECALL_TOP_K,
28 - JUDGE_PROMPT_VERSION_COMPLEX,  
29 RELEVANCE_EXACT, 33 RELEVANCE_EXACT,
  34 + RELEVANCE_HIGH,
30 RELEVANCE_IRRELEVANT, 35 RELEVANCE_IRRELEVANT,
31 - RELEVANCE_PARTIAL, 36 + RELEVANCE_LOW,
  37 + RELEVANCE_NON_IRRELEVANT,
32 VALID_LABELS, 38 VALID_LABELS,
33 ) 39 )
34 from .metrics import aggregate_metrics, compute_query_metrics, label_distribution 40 from .metrics import aggregate_metrics, compute_query_metrics, label_distribution
@@ -40,26 +46,44 @@ from .utils import ( @@ -40,26 +46,44 @@ from .utils import (
40 compact_option_values, 46 compact_option_values,
41 compact_product_payload, 47 compact_product_payload,
42 ensure_dir, 48 ensure_dir,
43 - normalize_text,  
44 - pick_text,  
45 sha1_text, 49 sha1_text,
46 utc_now_iso, 50 utc_now_iso,
47 utc_timestamp, 51 utc_timestamp,
  52 + zh_title_from_multilingual,
48 ) 53 )
49 54
50 55
  56 +def _zh_titles_from_debug_per_result(debug_info: Any) -> Dict[str, str]:
  57 + """Map ``spu_id`` -> Chinese title from ``debug_info.per_result[].title_multilingual``."""
  58 + out: Dict[str, str] = {}
  59 + if not isinstance(debug_info, dict):
  60 + return out
  61 + for entry in debug_info.get("per_result") or []:
  62 + if not isinstance(entry, dict):
  63 + continue
  64 + spu_id = str(entry.get("spu_id") or "").strip()
  65 + if not spu_id:
  66 + continue
  67 + zh = zh_title_from_multilingual(entry.get("title_multilingual"))
  68 + if zh:
  69 + out[spu_id] = zh
  70 + return out
  71 +
  72 +
51 class SearchEvaluationFramework: 73 class SearchEvaluationFramework:
52 def __init__( 74 def __init__(
53 self, 75 self,
54 tenant_id: str, 76 tenant_id: str,
55 artifact_root: Path = DEFAULT_ARTIFACT_ROOT, 77 artifact_root: Path = DEFAULT_ARTIFACT_ROOT,
56 search_base_url: str = "http://localhost:6002", 78 search_base_url: str = "http://localhost:6002",
57 - labeler_mode: str = DEFAULT_LABELER_MODE, 79 + *,
  80 + judge_model: str | None = None,
  81 + enable_thinking: bool | None = None,
  82 + use_dashscope_batch: bool | None = None,
58 ): 83 ):
59 init_service(get_app_config().infrastructure.elasticsearch.host) 84 init_service(get_app_config().infrastructure.elasticsearch.host)
60 self.tenant_id = str(tenant_id) 85 self.tenant_id = str(tenant_id)
61 self.artifact_root = ensure_dir(artifact_root) 86 self.artifact_root = ensure_dir(artifact_root)
62 - self.labeler_mode = str(labeler_mode).strip().lower() or DEFAULT_LABELER_MODE  
63 self.store = EvalStore(self.artifact_root / "search_eval.sqlite3") 87 self.store = EvalStore(self.artifact_root / "search_eval.sqlite3")
64 self.search_client = SearchServiceClient(search_base_url, self.tenant_id) 88 self.search_client = SearchServiceClient(search_base_url, self.tenant_id)
65 app_cfg = get_app_config() 89 app_cfg = get_app_config()
@@ -71,183 +95,20 @@ class SearchEvaluationFramework: @@ -71,183 +95,20 @@ class SearchEvaluationFramework:
71 api_key = app_cfg.infrastructure.secrets.dashscope_api_key 95 api_key = app_cfg.infrastructure.secrets.dashscope_api_key
72 if not api_key: 96 if not api_key:
73 raise RuntimeError("dashscope_api_key is required for search evaluation annotation") 97 raise RuntimeError("dashscope_api_key is required for search evaluation annotation")
  98 + model = str(judge_model or DEFAULT_JUDGE_MODEL)
  99 + et = DEFAULT_JUDGE_ENABLE_THINKING if enable_thinking is None else enable_thinking
  100 + use_batch = DEFAULT_JUDGE_DASHSCOPE_BATCH if use_dashscope_batch is None else use_dashscope_batch
  101 + batch_window = DEFAULT_JUDGE_BATCH_COMPLETION_WINDOW
  102 + batch_poll = float(DEFAULT_JUDGE_BATCH_POLL_INTERVAL_SEC)
74 self.label_client = DashScopeLabelClient( 103 self.label_client = DashScopeLabelClient(
75 - model=str(llm_cfg["model"]), 104 + model=model,
76 base_url=str(llm_cfg["base_url"]), 105 base_url=str(llm_cfg["base_url"]),
77 api_key=str(api_key), 106 api_key=str(api_key),
  107 + batch_completion_window=batch_window,
  108 + batch_poll_interval_sec=batch_poll,
  109 + enable_thinking=et,
  110 + use_batch=use_batch,
78 ) 111 )
79 - self.query_parser = None  
80 -  
81 - def _get_query_parser(self):  
82 - if self.query_parser is None:  
83 - self.query_parser = get_query_parser()  
84 - return self.query_parser  
85 -  
86 - def build_query_parser_hints(self, query: str) -> Dict[str, Any]:  
87 - parsed = self._get_query_parser().parse(query, generate_vector=False, target_languages=["en", "zh"])  
88 - payload = parsed.to_dict()  
89 - payload["text_for_rerank"] = parsed.text_for_rerank()  
90 - return payload  
91 -  
92 - def get_query_profile(self, query: str, force_refresh: bool = False) -> Dict[str, Any]:  
93 - if self.labeler_mode != "complex":  
94 - raise RuntimeError("query profiles are only used in complex labeler mode")  
95 - if not force_refresh:  
96 - cached = self.store.get_query_profile(self.tenant_id, query, JUDGE_PROMPT_VERSION_COMPLEX)  
97 - if cached is not None:  
98 - return cached  
99 - parser_hints = self.build_query_parser_hints(query)  
100 - profile, raw_response = self.label_client.extract_query_profile(query, parser_hints)  
101 - profile["parser_hints"] = parser_hints  
102 - self.store.upsert_query_profile(  
103 - self.tenant_id,  
104 - query,  
105 - JUDGE_PROMPT_VERSION_COMPLEX,  
106 - self.label_client.model,  
107 - profile,  
108 - raw_response,  
109 - )  
110 - return profile  
111 -  
112 - @staticmethod  
113 - def _doc_evidence_text(doc: Dict[str, Any]) -> str:  
114 - pieces: List[str] = [  
115 - build_display_title(doc),  
116 - pick_text(doc.get("vendor"), "en"),  
117 - pick_text(doc.get("category_path"), "en"),  
118 - pick_text(doc.get("category_name"), "en"),  
119 - ]  
120 - for sku in doc.get("skus") or []:  
121 - pieces.extend(  
122 - [  
123 - str(sku.get("option1_value") or ""),  
124 - str(sku.get("option2_value") or ""),  
125 - str(sku.get("option3_value") or ""),  
126 - ]  
127 - )  
128 - for tag in doc.get("tags") or []:  
129 - pieces.append(str(tag))  
130 - return normalize_text(" | ".join(piece for piece in pieces if piece))  
131 -  
132 - def _apply_rule_based_label_guardrails(  
133 - self,  
134 - label: str,  
135 - query_profile: Dict[str, Any],  
136 - doc: Dict[str, Any],  
137 - ) -> str:  
138 - if label not in VALID_LABELS:  
139 - return label  
140 - evidence = self._doc_evidence_text(doc)  
141 - category = normalize_text(query_profile.get("primary_category"))  
142 - allowed_categories = [normalize_text(item) for item in query_profile.get("allowed_categories") or [] if str(item).strip()]  
143 -  
144 - primary_category_match = True  
145 - if category:  
146 - primary_category_match = category in evidence  
147 - allowed_category_match = True  
148 - if allowed_categories:  
149 - allowed_category_match = any(signal in evidence for signal in allowed_categories)  
150 -  
151 - if label == RELEVANCE_EXACT and not primary_category_match:  
152 - if allowed_category_match:  
153 - label = RELEVANCE_PARTIAL  
154 - else:  
155 - return RELEVANCE_IRRELEVANT  
156 -  
157 - for attr in query_profile.get("required_attributes") or []:  
158 - if not isinstance(attr, dict):  
159 - continue  
160 - attr_name = normalize_text(attr.get("name"))  
161 - if attr_name not in {"color", "fit", "length", "type", "product_type", "material", "size", "gender", "style", "waist_style", "rise"}:  
162 - continue  
163 - required_terms = [normalize_text(item) for item in attr.get("required_terms") or [] if normalize_text(item)]  
164 - conflicting_terms = [normalize_text(item) for item in attr.get("conflicting_terms") or [] if normalize_text(item)]  
165 - if attr_name == "fit":  
166 - if any(term in {"oversized", "oversize"} for term in required_terms):  
167 - conflicting_terms.extend(["slim", "slimming", "fitted", "tight", "close-fitting"])  
168 - if any(term in {"fitted", "slim fit", "tight"} for term in required_terms):  
169 - conflicting_terms.extend(["oversized", "oversize", "loose", "relaxed"])  
170 - has_required = any(term in evidence for term in required_terms) if required_terms else True  
171 - has_conflict = any(term in evidence for term in conflicting_terms)  
172 -  
173 - if has_conflict:  
174 - return RELEVANCE_IRRELEVANT  
175 - if label == RELEVANCE_EXACT and not has_required:  
176 - label = RELEVANCE_PARTIAL  
177 -  
178 - if label == RELEVANCE_PARTIAL and not primary_category_match and not allowed_category_match:  
179 - return RELEVANCE_IRRELEVANT  
180 -  
181 - return label  
182 -  
183 - @staticmethod  
184 - def _result_item_to_doc(item: Dict[str, Any]) -> Dict[str, Any]:  
185 - option_values = list(item.get("option_values") or [])  
186 - while len(option_values) < 3:  
187 - option_values.append("")  
188 - product = dict(item.get("product") or {})  
189 - return {  
190 - "spu_id": item.get("spu_id"),  
191 - "title": product.get("title") or item.get("title"),  
192 - "vendor": product.get("vendor"),  
193 - "category_path": product.get("category"),  
194 - "category_name": product.get("category"),  
195 - "image_url": item.get("image_url") or product.get("image_url"),  
196 - "tags": product.get("tags") or [],  
197 - "skus": [  
198 - {  
199 - "option1_value": option_values[0],  
200 - "option2_value": option_values[1],  
201 - "option3_value": option_values[2],  
202 - }  
203 - ],  
204 - }  
205 -  
206 - def _collect_label_issues(  
207 - self,  
208 - label: str,  
209 - query_profile: Dict[str, Any],  
210 - doc: Dict[str, Any],  
211 - ) -> List[str]:  
212 - evidence = self._doc_evidence_text(doc)  
213 - issues: List[str] = []  
214 - category = normalize_text(query_profile.get("primary_category"))  
215 - allowed_categories = [  
216 - normalize_text(item)  
217 - for item in query_profile.get("allowed_categories") or []  
218 - if str(item).strip()  
219 - ]  
220 -  
221 - primary_category_match = True if not category else category in evidence  
222 - allowed_category_match = False if allowed_categories else primary_category_match  
223 - if allowed_categories:  
224 - allowed_category_match = any(signal in evidence for signal in allowed_categories)  
225 -  
226 - if label == RELEVANCE_EXACT and not primary_category_match:  
227 - if allowed_category_match:  
228 - issues.append("Exact missing primary category evidence")  
229 - else:  
230 - issues.append("Exact has category mismatch")  
231 -  
232 - if label == RELEVANCE_PARTIAL and not primary_category_match and not allowed_category_match:  
233 - issues.append("Partial has category mismatch")  
234 -  
235 - for attr in query_profile.get("required_attributes") or []:  
236 - if not isinstance(attr, dict):  
237 - continue  
238 - attr_name = normalize_text(attr.get("name"))  
239 - if attr_name not in {"color", "fit", "length", "type", "product_type", "material", "size", "gender", "style"}:  
240 - continue  
241 - required_terms = [normalize_text(item) for item in attr.get("required_terms") or [] if normalize_text(item)]  
242 - conflicting_terms = [normalize_text(item) for item in attr.get("conflicting_terms") or [] if normalize_text(item)]  
243 - has_required = any(term in evidence for term in required_terms) if required_terms else True  
244 - has_conflict = any(term in evidence for term in conflicting_terms)  
245 -  
246 - if has_conflict and label != RELEVANCE_IRRELEVANT:  
247 - issues.append(f"{label} conflicts on {attr_name}")  
248 - if label == RELEVANCE_EXACT and not has_required:  
249 - issues.append(f"Exact missing {attr_name}")  
250 - return issues  
251 112
252 def audit_live_query( 113 def audit_live_query(
253 self, 114 self,
@@ -258,42 +119,6 @@ class SearchEvaluationFramework: @@ -258,42 +119,6 @@ class SearchEvaluationFramework:
258 auto_annotate: bool = False, 119 auto_annotate: bool = False,
259 ) -> Dict[str, Any]: 120 ) -> Dict[str, Any]:
260 live = self.evaluate_live_query(query=query, top_k=top_k, auto_annotate=auto_annotate, language=language) 121 live = self.evaluate_live_query(query=query, top_k=top_k, auto_annotate=auto_annotate, language=language)
261 - if self.labeler_mode != "complex":  
262 - labels = [  
263 - item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT  
264 - for item in live["results"]  
265 - ]  
266 - return {  
267 - "query": query,  
268 - "tenant_id": self.tenant_id,  
269 - "top_k": top_k,  
270 - "metrics": live["metrics"],  
271 - "distribution": label_distribution(labels),  
272 - "query_profile": None,  
273 - "suspicious": [],  
274 - "results": live["results"],  
275 - }  
276 - query_profile = self.get_query_profile(query, force_refresh=False)  
277 - suspicious: List[Dict[str, Any]] = []  
278 -  
279 - for item in live["results"]:  
280 - doc = self._result_item_to_doc(item)  
281 - issues = self._collect_label_issues(item["label"] or "", query_profile, doc)  
282 - suggested_label = self._apply_rule_based_label_guardrails(item["label"] or "", query_profile, doc)  
283 - if suggested_label != (item["label"] or ""):  
284 - issues = list(issues) + [f"Suggested relabel: {item['label']} -> {suggested_label}"]  
285 - if issues:  
286 - suspicious.append(  
287 - {  
288 - "rank": item["rank"],  
289 - "spu_id": item["spu_id"],  
290 - "title": item["title"],  
291 - "label": item["label"],  
292 - "suggested_label": suggested_label,  
293 - "issues": issues,  
294 - }  
295 - )  
296 -  
297 labels = [ 122 labels = [
298 item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT 123 item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT
299 for item in live["results"] 124 for item in live["results"]
@@ -304,8 +129,8 @@ class SearchEvaluationFramework: @@ -304,8 +129,8 @@ class SearchEvaluationFramework:
304 "top_k": top_k, 129 "top_k": top_k,
305 "metrics": live["metrics"], 130 "metrics": live["metrics"],
306 "distribution": label_distribution(labels), 131 "distribution": label_distribution(labels),
307 - "query_profile": query_profile,  
308 - "suspicious": suspicious, 132 + "query_profile": None,
  133 + "suspicious": [],
309 "results": live["results"], 134 "results": live["results"],
310 } 135 }
311 136
@@ -485,15 +310,7 @@ class SearchEvaluationFramework: @@ -485,15 +310,7 @@ class SearchEvaluationFramework:
485 if not docs: 310 if not docs:
486 return [] 311 return []
487 try: 312 try:
488 - if self.labeler_mode == "complex":  
489 - query_profile = self.get_query_profile(query, force_refresh=force_refresh)  
490 - labels, raw_response = self.label_client.classify_batch_complex(query, query_profile, docs)  
491 - labels = [  
492 - self._apply_rule_based_label_guardrails(label, query_profile, doc)  
493 - for doc, label in zip(docs, labels)  
494 - ]  
495 - else:  
496 - labels, raw_response = self.label_client.classify_batch_simple(query, docs) 313 + labels, raw_response = self.label_client.classify_batch(query, docs)
497 return [(labels, raw_response, docs)] 314 return [(labels, raw_response, docs)]
498 except Exception: 315 except Exception:
499 if len(docs) == 1: 316 if len(docs) == 1:
@@ -510,10 +327,28 @@ class SearchEvaluationFramework: @@ -510,10 +327,28 @@ class SearchEvaluationFramework:
510 min_batches: int = DEFAULT_REBUILD_MIN_LLM_BATCHES, 327 min_batches: int = DEFAULT_REBUILD_MIN_LLM_BATCHES,
511 max_batches: int = DEFAULT_REBUILD_MAX_LLM_BATCHES, 328 max_batches: int = DEFAULT_REBUILD_MAX_LLM_BATCHES,
512 irrelevant_stop_ratio: float = DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO, 329 irrelevant_stop_ratio: float = DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO,
  330 + irrelevant_low_combined_stop_ratio: float = DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO,
513 stop_streak: int = DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK, 331 stop_streak: int = DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK,
514 force_refresh: bool = True, 332 force_refresh: bool = True,
515 ) -> Tuple[Dict[str, str], List[Dict[str, Any]]]: 333 ) -> Tuple[Dict[str, str], List[Dict[str, Any]]]:
516 - """LLM-label ``ordered_docs`` in fixed-size batches with early stop after enough irrelevant-heavy batches.""" 334 + """LLM-label ``ordered_docs`` in fixed-size batches along list order.
  335 +
  336 + **Early stop** (only after ``min_batches`` full batches have completed):
  337 +
  338 + Per batch, let *n* = batch size, and count labels among docs in that batch only.
  339 +
  340 + - *bad batch* iff there is **no** ``Exact Match`` in the batch **and** at least one of:
  341 +
  342 + - ``irrelevant_ratio = #(Irrelevant)/n >= irrelevant_stop_ratio`` (default 0.94), or
  343 + - ``( #(Irrelevant) + #(Low Relevant) ) / n >= irrelevant_low_combined_stop_ratio``
  344 + (default 0.96; weak relevance = ``RELEVANCE_LOW``).
  345 +
  346 + Maintain a streak of consecutive *bad* batches; any non-bad batch resets the streak to 0.
  347 + Stop labeling when ``streak >= stop_streak`` (default 2) or when ``max_batches`` is reached
  348 + or the ordered list is exhausted.
  349 +
  350 + Constants for defaults: ``eval_framework.constants`` (``DEFAULT_REBUILD_*``).
  351 + """
517 batch_logs: List[Dict[str, Any]] = [] 352 batch_logs: List[Dict[str, Any]] = []
518 streak = 0 353 streak = 0
519 labels: Dict[str, str] = dict(self.store.get_labels(self.tenant_id, query)) 354 labels: Dict[str, str] = dict(self.store.get_labels(self.tenant_id, query))
@@ -541,32 +376,46 @@ class SearchEvaluationFramework: @@ -541,32 +376,46 @@ class SearchEvaluationFramework:
541 n = len(batch_docs) 376 n = len(batch_docs)
542 exact_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_EXACT) 377 exact_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_EXACT)
543 irrel_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_IRRELEVANT) 378 irrel_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_IRRELEVANT)
  379 + low_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_LOW)
544 exact_ratio = exact_n / n if n else 0.0 380 exact_ratio = exact_n / n if n else 0.0
545 irrelevant_ratio = irrel_n / n if n else 0.0 381 irrelevant_ratio = irrel_n / n if n else 0.0
  382 + low_ratio = low_n / n if n else 0.0
  383 + irrel_low_ratio = (irrel_n + low_n) / n if n else 0.0
546 log_entry = { 384 log_entry = {
547 "batch_index": batch_idx + 1, 385 "batch_index": batch_idx + 1,
548 "size": n, 386 "size": n,
549 "exact_ratio": round(exact_ratio, 6), 387 "exact_ratio": round(exact_ratio, 6),
550 "irrelevant_ratio": round(irrelevant_ratio, 6), 388 "irrelevant_ratio": round(irrelevant_ratio, 6),
  389 + "low_ratio": round(low_ratio, 6),
  390 + "irrelevant_plus_low_ratio": round(irrel_low_ratio, 6),
551 "offset_start": start, 391 "offset_start": start,
552 "offset_end": min(start + n, total_ordered), 392 "offset_end": min(start + n, total_ordered),
553 } 393 }
554 batch_logs.append(log_entry) 394 batch_logs.append(log_entry)
555 print( 395 print(
556 f"[eval-rebuild] query={query!r} llm_batch={batch_idx + 1}/{max_batches} " 396 f"[eval-rebuild] query={query!r} llm_batch={batch_idx + 1}/{max_batches} "
557 - f"size={n} exact_ratio={exact_ratio:.4f} irrelevant_ratio={irrelevant_ratio:.4f}", 397 + f"size={n} exact_ratio={exact_ratio:.4f} irrelevant_ratio={irrelevant_ratio:.4f} "
  398 + f"irrel_plus_low_ratio={irrel_low_ratio:.4f}",
558 flush=True, 399 flush=True,
559 ) 400 )
560 401
  402 + # Early-stop streak: only evaluated after min_batches (warm-up before trusting tail quality).
561 if batch_idx + 1 >= min_batches: 403 if batch_idx + 1 >= min_batches:
562 - if irrelevant_ratio > irrelevant_stop_ratio: 404 + no_exact = exact_n == 0
  405 + # Branch 1: high Irrelevant share, no Exact in this batch.
  406 + heavy_irrel = irrelevant_ratio >= irrelevant_stop_ratio
  407 + # Branch 2: Irrelevant + Low Relevant combined share, still no Exact.
  408 + heavy_irrel_low = irrel_low_ratio >= irrelevant_low_combined_stop_ratio
  409 + bad_batch = no_exact and (heavy_irrel or heavy_irrel_low)
  410 + if bad_batch:
563 streak += 1 411 streak += 1
564 else: 412 else:
565 streak = 0 413 streak = 0
566 if streak >= stop_streak: 414 if streak >= stop_streak:
567 print( 415 print(
568 f"[eval-rebuild] query={query!r} early_stop after {batch_idx + 1} batches " 416 f"[eval-rebuild] query={query!r} early_stop after {batch_idx + 1} batches "
569 - f"({stop_streak} consecutive batches with irrelevant_ratio > {irrelevant_stop_ratio})", 417 + f"({stop_streak} consecutive batches: no Exact and "
  418 + f"(irrelevant>={irrelevant_stop_ratio} or irrel+low>={irrelevant_low_combined_stop_ratio}))",
570 flush=True, 419 flush=True,
571 ) 420 )
572 break 421 break
@@ -591,8 +440,19 @@ class SearchEvaluationFramework: @@ -591,8 +440,19 @@ class SearchEvaluationFramework:
591 rebuild_min_batches: int = DEFAULT_REBUILD_MIN_LLM_BATCHES, 440 rebuild_min_batches: int = DEFAULT_REBUILD_MIN_LLM_BATCHES,
592 rebuild_max_batches: int = DEFAULT_REBUILD_MAX_LLM_BATCHES, 441 rebuild_max_batches: int = DEFAULT_REBUILD_MAX_LLM_BATCHES,
593 rebuild_irrelevant_stop_ratio: float = DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO, 442 rebuild_irrelevant_stop_ratio: float = DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO,
  443 + rebuild_irrel_low_combined_stop_ratio: float = DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO,
594 rebuild_irrelevant_stop_streak: int = DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK, 444 rebuild_irrelevant_stop_streak: int = DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK,
595 ) -> QueryBuildResult: 445 ) -> QueryBuildResult:
  446 + """Build per-query annotation pool and write ``query_builds/*.json``.
  447 +
  448 + Normal mode unions search + rerank windows and fills missing labels once.
  449 +
  450 + **Rebuild mode** (``force_refresh_labels=True``): full recall pool + corpus rerank outside
  451 + pool, optional skip for "easy" queries, then batched LLM labeling with **early stop**;
  452 + see ``_build_query_annotation_set_rebuild`` and ``_annotate_rebuild_batches`` (docstring
  453 + spells out the bad-batch / streak rule). Rebuild tuning knobs: ``rebuild_*`` and
  454 + ``search_recall_top_k`` parameters below; CLI mirrors them under ``build --force-refresh-labels``.
  455 + """
596 if force_refresh_labels: 456 if force_refresh_labels:
597 return self._build_query_annotation_set_rebuild( 457 return self._build_query_annotation_set_rebuild(
598 query=query, 458 query=query,
@@ -607,6 +467,7 @@ class SearchEvaluationFramework: @@ -607,6 +467,7 @@ class SearchEvaluationFramework:
607 rebuild_min_batches=rebuild_min_batches, 467 rebuild_min_batches=rebuild_min_batches,
608 rebuild_max_batches=rebuild_max_batches, 468 rebuild_max_batches=rebuild_max_batches,
609 rebuild_irrelevant_stop_ratio=rebuild_irrelevant_stop_ratio, 469 rebuild_irrelevant_stop_ratio=rebuild_irrelevant_stop_ratio,
  470 + rebuild_irrel_low_combined_stop_ratio=rebuild_irrel_low_combined_stop_ratio,
610 rebuild_irrelevant_stop_streak=rebuild_irrelevant_stop_streak, 471 rebuild_irrelevant_stop_streak=rebuild_irrelevant_stop_streak,
611 ) 472 )
612 473
@@ -691,8 +552,6 @@ class SearchEvaluationFramework: @@ -691,8 +552,6 @@ class SearchEvaluationFramework:
691 "annotate_rerank_top_k": annotate_rerank_top_k, 552 "annotate_rerank_top_k": annotate_rerank_top_k,
692 "pool_size": len(pool_docs), 553 "pool_size": len(pool_docs),
693 }, 554 },
694 - "labeler_mode": self.labeler_mode,  
695 - "query_profile": self.get_query_profile(query, force_refresh=force_refresh_labels) if self.labeler_mode == "complex" else None,  
696 "metrics_top100": metrics, 555 "metrics_top100": metrics,
697 "search_results": search_labeled_results, 556 "search_results": search_labeled_results,
698 "full_rerank_top": rerank_top_results, 557 "full_rerank_top": rerank_top_results,
@@ -724,6 +583,7 @@ class SearchEvaluationFramework: @@ -724,6 +583,7 @@ class SearchEvaluationFramework:
724 rebuild_min_batches: int, 583 rebuild_min_batches: int,
725 rebuild_max_batches: int, 584 rebuild_max_batches: int,
726 rebuild_irrelevant_stop_ratio: float, 585 rebuild_irrelevant_stop_ratio: float,
  586 + rebuild_irrel_low_combined_stop_ratio: float,
727 rebuild_irrelevant_stop_streak: int, 587 rebuild_irrelevant_stop_streak: int,
728 ) -> QueryBuildResult: 588 ) -> QueryBuildResult:
729 search_size = max(int(search_depth), int(search_recall_top_k)) 589 search_size = max(int(search_depth), int(search_recall_top_k))
@@ -756,6 +616,7 @@ class SearchEvaluationFramework: @@ -756,6 +616,7 @@ class SearchEvaluationFramework:
756 "rebuild_min_batches": rebuild_min_batches, 616 "rebuild_min_batches": rebuild_min_batches,
757 "rebuild_max_batches": rebuild_max_batches, 617 "rebuild_max_batches": rebuild_max_batches,
758 "rebuild_irrelevant_stop_ratio": rebuild_irrelevant_stop_ratio, 618 "rebuild_irrelevant_stop_ratio": rebuild_irrelevant_stop_ratio,
  619 + "rebuild_irrel_low_combined_stop_ratio": rebuild_irrel_low_combined_stop_ratio,
759 "rebuild_irrelevant_stop_streak": rebuild_irrelevant_stop_streak, 620 "rebuild_irrelevant_stop_streak": rebuild_irrelevant_stop_streak,
760 } 621 }
761 622
@@ -797,6 +658,7 @@ class SearchEvaluationFramework: @@ -797,6 +658,7 @@ class SearchEvaluationFramework:
797 min_batches=rebuild_min_batches, 658 min_batches=rebuild_min_batches,
798 max_batches=rebuild_max_batches, 659 max_batches=rebuild_max_batches,
799 irrelevant_stop_ratio=rebuild_irrelevant_stop_ratio, 660 irrelevant_stop_ratio=rebuild_irrelevant_stop_ratio,
  661 + irrelevant_low_combined_stop_ratio=rebuild_irrel_low_combined_stop_ratio,
800 stop_streak=rebuild_irrelevant_stop_streak, 662 stop_streak=rebuild_irrelevant_stop_streak,
801 force_refresh=True, 663 force_refresh=True,
802 ) 664 )
@@ -867,8 +729,6 @@ class SearchEvaluationFramework: @@ -867,8 +729,6 @@ class SearchEvaluationFramework:
867 "rebuild": rebuild_meta, 729 "rebuild": rebuild_meta,
868 "ordered_union_size": pool_docs_count, 730 "ordered_union_size": pool_docs_count,
869 }, 731 },
870 - "labeler_mode": self.labeler_mode,  
871 - "query_profile": self.get_query_profile(query, force_refresh=False) if self.labeler_mode == "complex" else None,  
872 "metrics_top100": metrics, 732 "metrics_top100": metrics,
873 "search_results": search_labeled_results, 733 "search_results": search_labeled_results,
874 "full_rerank_top": rerank_top_results, 734 "full_rerank_top": rerank_top_results,
@@ -893,7 +753,10 @@ class SearchEvaluationFramework: @@ -893,7 +753,10 @@ class SearchEvaluationFramework:
893 language: str = "en", 753 language: str = "en",
894 force_refresh_labels: bool = False, 754 force_refresh_labels: bool = False,
895 ) -> Dict[str, Any]: 755 ) -> Dict[str, Any]:
896 - search_payload = self.search_client.search(query=query, size=max(top_k, 100), from_=0, language=language) 756 + search_payload = self.search_client.search(
  757 + query=query, size=max(top_k, 100), from_=0, language=language, debug=True
  758 + )
  759 + zh_by_spu = _zh_titles_from_debug_per_result(search_payload.get("debug_info"))
897 results = list(search_payload.get("results") or []) 760 results = list(search_payload.get("results") or [])
898 if auto_annotate: 761 if auto_annotate:
899 self.annotate_missing_labels(query=query, docs=results[:top_k], force_refresh=force_refresh_labels) 762 self.annotate_missing_labels(query=query, docs=results[:top_k], force_refresh=force_refresh_labels)
@@ -906,11 +769,16 @@ class SearchEvaluationFramework: @@ -906,11 +769,16 @@ class SearchEvaluationFramework:
906 label = labels.get(spu_id) 769 label = labels.get(spu_id)
907 if label not in VALID_LABELS: 770 if label not in VALID_LABELS:
908 unlabeled_hits += 1 771 unlabeled_hits += 1
  772 + primary_title = build_display_title(doc)
  773 + title_zh = zh_by_spu.get(spu_id) or ""
  774 + if not title_zh and isinstance(doc.get("title"), dict):
  775 + title_zh = zh_title_from_multilingual(doc.get("title"))
909 labeled.append( 776 labeled.append(
910 { 777 {
911 "rank": rank, 778 "rank": rank,
912 "spu_id": spu_id, 779 "spu_id": spu_id,
913 - "title": build_display_title(doc), 780 + "title": primary_title,
  781 + "title_zh": title_zh if title_zh and title_zh != primary_title else "",
914 "image_url": doc.get("image_url"), 782 "image_url": doc.get("image_url"),
915 "label": label, 783 "label": label,
916 "option_values": list(compact_option_values(doc.get("skus") or [])), 784 "option_values": list(compact_option_values(doc.get("skus") or [])),
@@ -926,7 +794,7 @@ class SearchEvaluationFramework: @@ -926,7 +794,7 @@ class SearchEvaluationFramework:
926 relevant_missing_ids = [ 794 relevant_missing_ids = [
927 spu_id 795 spu_id
928 for spu_id, label in labels.items() 796 for spu_id, label in labels.items()
929 - if label in {RELEVANCE_EXACT, RELEVANCE_PARTIAL} and spu_id not in recalled_spu_ids 797 + if label in RELEVANCE_NON_IRRELEVANT and spu_id not in recalled_spu_ids
930 ] 798 ]
931 missing_docs_map = self.store.get_corpus_docs_by_spu_ids(self.tenant_id, relevant_missing_ids) 799 missing_docs_map = self.store.get_corpus_docs_by_spu_ids(self.tenant_id, relevant_missing_ids)
932 missing_relevant = [] 800 missing_relevant = []
@@ -934,18 +802,26 @@ class SearchEvaluationFramework: @@ -934,18 +802,26 @@ class SearchEvaluationFramework:
934 doc = missing_docs_map.get(spu_id) 802 doc = missing_docs_map.get(spu_id)
935 if not doc: 803 if not doc:
936 continue 804 continue
  805 + miss_title = build_display_title(doc)
  806 + miss_zh = zh_title_from_multilingual(doc.get("title")) if isinstance(doc.get("title"), dict) else ""
937 missing_relevant.append( 807 missing_relevant.append(
938 { 808 {
939 "spu_id": spu_id, 809 "spu_id": spu_id,
940 "label": labels[spu_id], 810 "label": labels[spu_id],
941 "rerank_score": rerank_scores.get(spu_id), 811 "rerank_score": rerank_scores.get(spu_id),
942 - "title": build_display_title(doc), 812 + "title": miss_title,
  813 + "title_zh": miss_zh if miss_zh and miss_zh != miss_title else "",
943 "image_url": doc.get("image_url"), 814 "image_url": doc.get("image_url"),
944 "option_values": list(compact_option_values(doc.get("skus") or [])), 815 "option_values": list(compact_option_values(doc.get("skus") or [])),
945 "product": compact_product_payload(doc), 816 "product": compact_product_payload(doc),
946 } 817 }
947 ) 818 )
948 - label_order = {RELEVANCE_EXACT: 0, RELEVANCE_PARTIAL: 1, RELEVANCE_IRRELEVANT: 2} 819 + label_order = {
  820 + RELEVANCE_EXACT: 0,
  821 + RELEVANCE_HIGH: 1,
  822 + RELEVANCE_LOW: 2,
  823 + RELEVANCE_IRRELEVANT: 3,
  824 + }
949 missing_relevant.sort( 825 missing_relevant.sort(
950 key=lambda item: ( 826 key=lambda item: (
951 label_order.get(str(item.get("label")), 9), 827 label_order.get(str(item.get("label")), 9),
@@ -963,7 +839,7 @@ class SearchEvaluationFramework: @@ -963,7 +839,7 @@ class SearchEvaluationFramework:
963 if unlabeled_hits: 839 if unlabeled_hits:
964 tips.append(f"{unlabeled_hits} recalled results were not in the annotation set and were counted as Irrelevant.") 840 tips.append(f"{unlabeled_hits} recalled results were not in the annotation set and were counted as Irrelevant.")
965 if not missing_relevant: 841 if not missing_relevant:
966 - tips.append("No cached Exact/Partial products were missed by this recall set.") 842 + tips.append("No cached non-irrelevant products were missed by this recall set.")
967 return { 843 return {
968 "query": query, 844 "query": query,
969 "tenant_id": self.tenant_id, 845 "tenant_id": self.tenant_id,
@@ -977,7 +853,8 @@ class SearchEvaluationFramework: @@ -977,7 +853,8 @@ class SearchEvaluationFramework:
977 "recalled_hits": len(labeled), 853 "recalled_hits": len(labeled),
978 "missing_relevant_count": len(missing_relevant), 854 "missing_relevant_count": len(missing_relevant),
979 "missing_exact_count": sum(1 for item in missing_relevant if item["label"] == RELEVANCE_EXACT), 855 "missing_exact_count": sum(1 for item in missing_relevant if item["label"] == RELEVANCE_EXACT),
980 - "missing_partial_count": sum(1 for item in missing_relevant if item["label"] == RELEVANCE_PARTIAL), 856 + "missing_high_count": sum(1 for item in missing_relevant if item["label"] == RELEVANCE_HIGH),
  857 + "missing_low_count": sum(1 for item in missing_relevant if item["label"] == RELEVANCE_LOW),
981 }, 858 },
982 "tips": tips, 859 "tips": tips,
983 "total": int(search_payload.get("total") or 0), 860 "total": int(search_payload.get("total") or 0),
@@ -1018,7 +895,8 @@ class SearchEvaluationFramework: @@ -1018,7 +895,8 @@ class SearchEvaluationFramework:
1018 aggregate = aggregate_metrics([item["metrics"] for item in per_query]) 895 aggregate = aggregate_metrics([item["metrics"] for item in per_query])
1019 aggregate_distribution = { 896 aggregate_distribution = {
1020 RELEVANCE_EXACT: sum(item["distribution"][RELEVANCE_EXACT] for item in per_query), 897 RELEVANCE_EXACT: sum(item["distribution"][RELEVANCE_EXACT] for item in per_query),
1021 - RELEVANCE_PARTIAL: sum(item["distribution"][RELEVANCE_PARTIAL] for item in per_query), 898 + RELEVANCE_HIGH: sum(item["distribution"][RELEVANCE_HIGH] for item in per_query),
  899 + RELEVANCE_LOW: sum(item["distribution"][RELEVANCE_LOW] for item in per_query),
1022 RELEVANCE_IRRELEVANT: sum(item["distribution"][RELEVANCE_IRRELEVANT] for item in per_query), 900 RELEVANCE_IRRELEVANT: sum(item["distribution"][RELEVANCE_IRRELEVANT] for item in per_query),
1023 } 901 }
1024 batch_id = f"batch_{utc_timestamp()}_{sha1_text(self.tenant_id + '|' + '|'.join(queries))[:10]}" 902 batch_id = f"batch_{utc_timestamp()}_{sha1_text(self.tenant_id + '|' + '|'.join(queries))[:10]}"
scripts/evaluation/eval_framework/metrics.py
@@ -4,7 +4,7 @@ from __future__ import annotations @@ -4,7 +4,7 @@ from __future__ import annotations
4 4
5 from typing import Dict, Sequence 5 from typing import Dict, Sequence
6 6
7 -from .constants import RELEVANCE_EXACT, RELEVANCE_IRRELEVANT, RELEVANCE_PARTIAL 7 +from .constants import RELEVANCE_EXACT, RELEVANCE_IRRELEVANT, RELEVANCE_HIGH, RELEVANCE_LOW, RELEVANCE_NON_IRRELEVANT
8 8
9 9
10 def precision_at_k(labels: Sequence[str], k: int, relevant: Sequence[str]) -> float: 10 def precision_at_k(labels: Sequence[str], k: int, relevant: Sequence[str]) -> float:
@@ -13,15 +13,17 @@ def precision_at_k(labels: Sequence[str], k: int, relevant: Sequence[str]) -&gt; fl @@ -13,15 +13,17 @@ def precision_at_k(labels: Sequence[str], k: int, relevant: Sequence[str]) -&gt; fl
13 sliced = list(labels[:k]) 13 sliced = list(labels[:k])
14 if not sliced: 14 if not sliced:
15 return 0.0 15 return 0.0
16 - hits = sum(1 for label in sliced if label in relevant) 16 + rel = set(relevant)
  17 + hits = sum(1 for label in sliced if label in rel)
17 return hits / float(min(k, len(sliced))) 18 return hits / float(min(k, len(sliced)))
18 19
19 20
20 def average_precision(labels: Sequence[str], relevant: Sequence[str]) -> float: 21 def average_precision(labels: Sequence[str], relevant: Sequence[str]) -> float:
  22 + rel = set(relevant)
21 hit_count = 0 23 hit_count = 0
22 precision_sum = 0.0 24 precision_sum = 0.0
23 for idx, label in enumerate(labels, start=1): 25 for idx, label in enumerate(labels, start=1):
24 - if label not in relevant: 26 + if label not in rel:
25 continue 27 continue
26 hit_count += 1 28 hit_count += 1
27 precision_sum += hit_count / idx 29 precision_sum += hit_count / idx
@@ -31,12 +33,14 @@ def average_precision(labels: Sequence[str], relevant: Sequence[str]) -&gt; float: @@ -31,12 +33,14 @@ def average_precision(labels: Sequence[str], relevant: Sequence[str]) -&gt; float:
31 33
32 34
33 def compute_query_metrics(labels: Sequence[str]) -> Dict[str, float]: 35 def compute_query_metrics(labels: Sequence[str]) -> Dict[str, float]:
  36 + """P@k / MAP_3: Exact Match only. P@k_2_3 / MAP_2_3: any non-irrelevant tier (legacy metric names)."""
34 metrics: Dict[str, float] = {} 37 metrics: Dict[str, float] = {}
  38 + non_irrel = list(RELEVANCE_NON_IRRELEVANT)
35 for k in (5, 10, 20, 50): 39 for k in (5, 10, 20, 50):
36 metrics[f"P@{k}"] = round(precision_at_k(labels, k, [RELEVANCE_EXACT]), 6) 40 metrics[f"P@{k}"] = round(precision_at_k(labels, k, [RELEVANCE_EXACT]), 6)
37 - metrics[f"P@{k}_2_3"] = round(precision_at_k(labels, k, [RELEVANCE_EXACT, RELEVANCE_PARTIAL]), 6) 41 + metrics[f"P@{k}_2_3"] = round(precision_at_k(labels, k, non_irrel), 6)
38 metrics["MAP_3"] = round(average_precision(labels, [RELEVANCE_EXACT]), 6) 42 metrics["MAP_3"] = round(average_precision(labels, [RELEVANCE_EXACT]), 6)
39 - metrics["MAP_2_3"] = round(average_precision(labels, [RELEVANCE_EXACT, RELEVANCE_PARTIAL]), 6) 43 + metrics["MAP_2_3"] = round(average_precision(labels, non_irrel), 6)
40 return metrics 44 return metrics
41 45
42 46
@@ -53,6 +57,7 @@ def aggregate_metrics(metric_items: Sequence[Dict[str, float]]) -&gt; Dict[str, flo @@ -53,6 +57,7 @@ def aggregate_metrics(metric_items: Sequence[Dict[str, float]]) -&gt; Dict[str, flo
53 def label_distribution(labels: Sequence[str]) -> Dict[str, int]: 57 def label_distribution(labels: Sequence[str]) -> Dict[str, int]:
54 return { 58 return {
55 RELEVANCE_EXACT: sum(1 for label in labels if label == RELEVANCE_EXACT), 59 RELEVANCE_EXACT: sum(1 for label in labels if label == RELEVANCE_EXACT),
56 - RELEVANCE_PARTIAL: sum(1 for label in labels if label == RELEVANCE_PARTIAL), 60 + RELEVANCE_HIGH: sum(1 for label in labels if label == RELEVANCE_HIGH),
  61 + RELEVANCE_LOW: sum(1 for label in labels if label == RELEVANCE_LOW),
57 RELEVANCE_IRRELEVANT: sum(1 for label in labels if label == RELEVANCE_IRRELEVANT), 62 RELEVANCE_IRRELEVANT: sum(1 for label in labels if label == RELEVANCE_IRRELEVANT),
58 } 63 }
scripts/evaluation/eval_framework/prompts.py
@@ -2,84 +2,139 @@ @@ -2,84 +2,139 @@
2 2
3 from __future__ import annotations 3 from __future__ import annotations
4 4
5 -import json  
6 -from typing import Any, Dict, Sequence 5 +from typing import Sequence
7 6
8 -_CLASSIFY_BATCH_SIMPLE_TEMPLATE = """You are a relevance evaluation assistant for an apparel e-commerce search system.  
9 -Given the user query and each product's information, assign one relevance label to each product. 7 +_CLASSIFY_TEMPLATE_EN = """You are a relevance judgment assistant for a fashion e-commerce search system.
  8 +Given a user query and the information for each product, assign a relevance label to each product.
  9 +
  10 +Your goal is to judge relevance from the perspective of e-commerce search ranking.
  11 +The key question is whether the user would view the product as the intended item, or as an acceptable substitute.
10 12
11 ## Relevance Labels 13 ## Relevance Labels
12 14
13 -### Exact  
14 -The product fully satisfies the user’s search intent: the core product type matches, all explicitly stated key attributes are supported by the product information. 15 +### Exact Match
  16 +The product satisfies the user’s core shopping intent: the core product type matches, and all explicitly stated key attributes in the query are supported by the product information, with no obvious conflict.
15 17
16 Typical use cases: 18 Typical use cases:
17 - The query contains only a product type, and the product is exactly that type. 19 - The query contains only a product type, and the product is exactly that type.
18 -- The query contains product type + attributes, and the product matches both the type and all explicitly stated attributes. 20 +- The query contains “product type + attributesâ€, and the product matches both the type and all explicitly stated attributes.
19 21
20 -### Partial  
21 -The product satisfies the user's primary intent because the core product type matches, but some explicit requirements in the query are missing, cannot be confirmed, or deviate from the query. Despite the mismatch, the product can still be considered a non-target but acceptable substitute. 22 +### High Relevant
  23 +The product satisfies the user’s main intent: the core product type matches, but some explicitly requested attributes are missing from the product information, cannot be confirmed, or show minor / non-critical deviations. The product is still a good substitute for the user’s core need.
22 24
23 -Use Partial when:  
24 -- The core product type matches, but some requested attributes cannot be confirmed.  
25 -- The core product type matches, but some secondary requirements deviate or are inconsistent.  
26 -- The product is not the ideal target, but it is still a plausible and acceptable substitute for the shopper. 25 +Use “High Relevant†in the following cases:
  26 +- The core product type matches, but some requested attributes are missing, not mentioned, or cannot be verified.
  27 +- The core product type matches, but attributes such as color, material, style, fit, or length have minor deviations, as long as the deviation does not materially undermine the user’s main shopping intent.
  28 +- The product is not the user’s ideal target, but in an e-commerce shopping context, it would still be considered an acceptable and strong substitute.
27 29
28 -Typical cases:  
29 -- Query: "red fitted t-shirt", product: "Women's T-Shirt" → color/fit cannot be confirmed.  
30 -- Query: "red fitted t-shirt", product: "Blue Fitted T-Shirt" → product type and fit match, but color differs. 30 +Typical examples:
  31 +- Query: “red slim-fit T-shirtâ€
  32 + Product: “women’s T-shirtâ€
  33 + → Color and fit cannot be confirmed.
  34 +- Query: “red slim-fit T-shirtâ€
  35 + Product: “blue slim-fit T-shirtâ€
  36 + → Product type and fit match, but the color is different.
31 37
32 -Detailed example:  
33 -- Query: "cotton long sleeve shirt"  
34 -- Product: "J.VER Men's Linen Shirts Casual Button Down Long Sleeve Shirt Solid Spread Collar Summer Beach Shirts with Pocket" 38 +Detailed case:
  39 +- Query: “cotton long-sleeve shirtâ€
  40 +- Product: “J.VER Men's Linen Shirt Casual Button Down Long Sleeve Solid Plain Collar Summer Beach Shirt with Pocketâ€
35 41
36 Analysis: 42 Analysis:
37 -- Material mismatch: the query explicitly requires cotton, while the product is linen, so it cannot be Exact.  
38 -- However, the core product type still matches: both are long sleeve shirts.  
39 -- In an e-commerce setting, the shopper may still consider clicking this item because the style and use case are similar.  
40 -- Therefore, it should be labeled Partial as a non-target but acceptable substitute. 43 +- Material mismatch: the query explicitly requires “cottonâ€, while the product is “linenâ€, so it cannot be labeled as “Exact Matchâ€.
  44 +- However, the core category still matches: both are long-sleeve shirts.
  45 +- In e-commerce search, users may still click this item because the style and wearing scenario are similar.
  46 +- Therefore, it should be labeled as “High Relevantâ€: not the exact target, but a good substitute.
41 47
42 -### Irrelevant  
43 -The product does not satisfy the user's main shopping intent. 48 +Detailed case:
  49 +- Query: “black mid-length skirtâ€
  50 +- Product: “New spring autumn loose slimming full long floral skirt pleated skirtâ€
44 51
45 -Use Irrelevant when:  
46 -- The core product type does not match the query.  
47 -- The product belongs to a broadly related category, but not the specific product subtype requested, and shoppers would not consider them interchangeable.  
48 -- The core product type matches, but the product clearly contradicts an explicit and important requirement in the query. 52 +Analysis:
  53 +- Category match: the product is a skirt, so the category matches.
  54 +- Color mismatch: the product description does not indicate black and explicitly mentions “floralâ€, which is substantially different from plain black.
  55 +- Length deviation: the user asks for “mid-lengthâ€, while the product title emphasizes “long skirtâ€, which is somewhat longer.
  56 +- However, the core category “skirt†still matches, and style features such as “slimming†and “full skirt†may still fit some preferences of users searching for a mid-length skirt. Also, “long†versus “mid-length†is a deviation, but not a severe contradiction.
  57 +- Therefore, this should be labeled as “High Relevantâ€: the core type matches, but there are several non-fatal attribute deviations.
49 58
50 -Typical cases:  
51 -- Query: "pants", product: "shoes" → wrong product type.  
52 -- Query: "dress", product: "skirt" → different product type.  
53 -- Query: "fitted pants", product: "loose wide-leg pants" → explicit contradiction on fit.  
54 -- Query: "sleeveless dress", product: "long sleeve dress" → explicit contradiction on sleeve style. 59 +### Low Relevant
  60 +The product has a noticeable gap from the user’s core target, but still shares some similarity with the query in style, scenario, function, or broader category. A small portion of users may still view it as a barely acceptable substitute. It is not the intended item, but still has some relevance.
55 61
56 -This label emphasizes clarity of user intent. When the query specifies a concrete product type or an important attribute, products that conflict with that intent should be judged Irrelevant even if they are related at a higher category level. 62 +Use “Low Relevant†in the following cases:
  63 +- The core product type does not match, but the two types are still very close in style, wearing scenario, or function, so there is still some substitutability.
  64 +- The core product type matches, but the product differs from the user’s ideal target on multiple attributes; it still has some relevance, but is no longer a strong substitute.
  65 +- An important query requirement is clearly violated, but the product still retains a limited reason to be clicked.
57 66
58 -## Decision Principles 67 +Typical cases:
  68 +- Query: “black mid-length skirtâ€
  69 + Product: “New high-waisted V-neck mid-length dress elegant printed black sexy dressâ€
  70 + → The core product type differs (“skirt†vs “dressâ€), but both belong to closely related apparel types and share a similar mid-length style, so it is “Low Relevantâ€.
59 71
60 -1. Product type is the highest-priority factor.  
61 - If the query clearly specifies a concrete product type, the result must match that product type to be Exact or Partial.  
62 - A different product type is usually Irrelevant, not Partial. 72 +- Query: “jeansâ€
  73 + Product: “casual pantsâ€
  74 + → The core product type is different, but both belong to the broader pants category, and the style / wearing scenario may still be close enough to be a weak substitute.
63 75
64 -2. Similar or related product types are not interchangeable when the query is specific. 76 +### Irrelevant
  77 +The product does not satisfy the user’s main shopping intent, and the likelihood of user engagement is very low.
  78 +
  79 +Typical situations:
  80 +- The core product type does not match the query and is not a close substitute in style, scenario, or function.
  81 +- The product belongs to a roughly related broader category, but not to an interchangeable subtype explicitly requested in the query, and the style or usage scenario differs significantly.
  82 +- The core product type matches, but the product clearly violates an explicit and important requirement in the query, with little or no acceptable substitutability.
  83 +
  84 +Typical examples:
  85 +- Query: “pantsâ€
  86 + Product: “shoesâ€
  87 + → Wrong product type.
  88 +- Query: “slim-fit pantsâ€
  89 + Product: “loose wide-leg pantsâ€
  90 + → Clear contradiction in fit, with extremely low substitutability.
  91 +- Query: “sleeveless dressâ€
  92 + Product: “long-sleeve dressâ€
  93 + → Clear contradiction in sleeve type.
  94 +- Query: “jeansâ€
  95 + Product: “sweatpantsâ€
  96 + → Different core category, with significantly different style and wearing scenario.
  97 +- Query: “bootsâ€
  98 + Product: “sneakersâ€
  99 + → Different core category, different function, and different usage scenario.
  100 +
  101 +## Judgment Principles
  102 +
  103 +1. **Product type is the highest-priority factor.**
  104 + If the query explicitly specifies a concrete product type, the result must match that product type in order to be labeled as “Exact Match†or “High Relevantâ€.
  105 + Different product types should usually be labeled as “Low Relevant†or “Irrelevantâ€.
  106 +
  107 + - **Low Relevant**: use only when the two product types are very close in style, scenario, or function, and the user may still treat one as a barely acceptable substitute for the other.
  108 + - **Irrelevant**: all other product type mismatch cases.
  109 +
  110 +2. **Similar or related product types are usually not directly interchangeable when the query is explicit, but their closeness should determine whether the label is “Low Relevant†or “Irrelevantâ€.**
65 For example: 111 For example:
66 - - dress vs skirt vs jumpsuit  
67 - - jeans vs pants  
68 - - t-shirt vs blouse  
69 - - cardigan vs sweater  
70 - - boots vs shoes  
71 - - bra vs top  
72 - - backpack vs bag  
73 - If the user explicitly searched for one of these, the others should usually be judged Irrelevant.  
74 -  
75 -3. If the core product type matches, then evaluate attributes.  
76 - - If all explicit attributes match → Exact  
77 - - If some attributes are missing, uncertain, or partially mismatched, but the item is still an acceptable substitute → Partial  
78 - - If an explicit and important attribute is clearly contradicted, and the item is not a reasonable substitute → Irrelevant  
79 -  
80 -4. Distinguish carefully between "not mentioned" and "contradicted".  
81 - - If an attribute is not mentioned or cannot be verified, prefer Partial.  
82 - - If an attribute is explicitly opposite to the query, use Irrelevant unless the item is still reasonably acceptable as a substitute under the shopping context. 112 + - **May be Low Relevant due to strong similarity in style / scenario**: dress vs skirt, long skirt vs mid-length skirt, jeans vs casual pants, sneakers vs skate shoes.
  113 + - **Should be Irrelevant due to substantial difference in style / scenario**: pants vs shoes, T-shirt vs hat, boots vs sneakers, jeans vs suit pants, backpack vs handbag.
  114 +
  115 +3. **Once the core product type matches, evaluate attributes.**
  116 + - All explicit attributes match → **Exact Match**
  117 + - Some attributes are missing, not mentioned, cannot be verified, or show only minor deviations → **High Relevant**
  118 + - There are multiple attribute deviations, or an important attribute is clearly violated, but the product still retains some substitutability → **Low Relevant**
  119 + - There is a clear and important hard conflict, and substitutability is extremely low → **Irrelevant**
  120 +
  121 +4. **Strictly distinguish among “not mentioned / cannot confirmâ€, “minor deviationâ€, and “explicit contradictionâ€.**
  122 + - If an attribute is not mentioned or cannot be verified, prefer **High Relevant**.
  123 + - If an attribute shows a minor deviation, such as different color, different material, or slightly different length, it should usually be labeled **High Relevant**.
  124 + - If an attribute is explicitly opposite to the query requirement, such as sleeveless vs long-sleeve or slim-fit vs loose wide-leg, decide between **Low Relevant** and **Irrelevant** based on the severity of the conflict and practical substitutability.
  125 + - If the conflict directly breaks the user’s main shopping goal, it should usually be labeled **Irrelevant**.
  126 +
  127 +5. **Substitutability should be judged from real shopping intent, not just surface-level textual similarity.**
  128 + The question is whether the user would realistically accept the product in a shopping scenario.
  129 + - Good substitute → **High Relevant**
  130 + - Barely acceptable substitute → **Low Relevant**
  131 + - Hardly substitutable at all → **Irrelevant**
  132 +
  133 +6. **When product information is insufficient, do not treat “cannot confirm†as “conflictâ€.**
  134 + If a product does not mention an attribute, that does not mean the attribute is definitely violated.
  135 + Therefore:
  136 + - If the attribute is not mentioned or cannot be confirmed, prefer **High Relevant**;
  137 + - Only treat it as a conflict when the product information clearly shows the opposite of the query requirement.
83 138
84 Query: {query} 139 Query: {query}
85 140
@@ -87,88 +142,139 @@ Products: @@ -87,88 +142,139 @@ Products:
87 {lines} 142 {lines}
88 143
89 ## Output Format 144 ## Output Format
90 -Strictly output {n} lines, each line containing exactly one of:  
91 -Exact  
92 -Partial 145 +Output exactly {n} lines.
  146 +Each line must be exactly one of the following:
  147 +Exact Match
  148 +High Relevant
  149 +Low Relevant
93 Irrelevant 150 Irrelevant
94 151
95 -The lines must correspond sequentially to the products above.  
96 -Do not output any other information. 152 +The output lines must correspond to the products above in the same order.
  153 +Do not output anything else.
97 """ 154 """
98 155
99 -_CLASSIFY_BATCH_SIMPLE_TEMPLATE__zh = _CLASSIFY_BATCH_SIMPLE_TEMPLATE_ZH = """你是一个æœé¥°ç”µå•†æœç´¢ç³»ç»Ÿä¸­çš„相关性判断助手。 156 +_CLASSIFY_TEMPLATE_ZH = """你是一个æœé¥°ç”µå•†æœç´¢ç³»ç»Ÿä¸­çš„相关性判断助手。
100 给定用户查询è¯ä»¥åŠæ¯ä¸ªå•†å“的信æ¯ï¼Œè¯·ä¸ºæ¯ä¸ªå•†å“分é…一个相关性标签。 157 给定用户查询è¯ä»¥åŠæ¯ä¸ªå•†å“的信æ¯ï¼Œè¯·ä¸ºæ¯ä¸ªå•†å“分é…一个相关性标签。
101 158
  159 +你的目标是从电商æœç´¢æŽ’åºçš„è§’åº¦ï¼Œåˆ¤æ–­å•†å“æ˜¯å¦æ»¡è¶³ç”¨æˆ·çš„购物æ„图。
  160 +判断时应优先考虑“用户是å¦ä¼šæŠŠè¯¥å•†å“视为目标商å“ï¼Œæˆ–å¯æŽ¥å—的替代å“â€ã€‚
  161 +
102 ## 相关性标签 162 ## 相关性标签
103 163
104 ### 完全相关 164 ### 完全相关
105 -核心产å“类型匹é…,所有明确æåŠçš„å…³é”®å±žæ€§å‡æœ‰äº§å“ä¿¡æ¯æ”¯æ’‘。 165 +商哿»¡è¶³ç”¨æˆ·çš„æ ¸å¿ƒè´­ç‰©æ„图:核心商å“类型匹é…,且查询中所有明确æåŠçš„å…³é”®å±žæ€§å‡æœ‰å•†å“ä¿¡æ¯æ”¯æŒã€‚
106 166
107 典型适用场景: 167 典型适用场景:
108 -- 查询仅包å«äº§å“类型,产å“å³ä¸ºè¯¥ç±»åž‹ã€‚  
109 -- 查询包å«â€œäº§å“类型 + 属性â€ï¼Œäº§å“åœ¨ç±»åž‹åŠæ‰€æœ‰æ˜Žç¡®å±žæ€§ä¸Šå‡ç¬¦åˆã€‚ 168 +- 查询仅包å«å•†å“类型,商å“å³ä¸ºè¯¥ç±»åž‹ã€‚
  169 +- 查询包å«â€œå•†å“类型 + 属性â€ï¼Œå•†å“åœ¨ç±»åž‹åŠæ‰€æœ‰æ˜Žç¡®å±žæ€§ä¸Šå‡ç¬¦åˆã€‚
110 170
111 -### 部分相关  
112 -äº§å“æ»¡è¶³ç”¨æˆ·çš„ä¸»è¦æ„图(核心产å“类型匹é…ï¼‰ï¼Œä½†æŸ¥è¯¢ä¸­æ˜Žç¡®çš„éƒ¨åˆ†è¦æ±‚未体现,或存在å差。虽然有ä¸ä¸€è‡´ï¼Œä½†ä»å±žäºŽâ€œéžç›®æ ‡ä½†å¯æŽ¥å—â€çš„æ›¿ä»£å“。 171 +### 基本相关
  172 +商哿»¡è¶³ç”¨æˆ·çš„ä¸»è¦æ„图:核心商å“类型匹é…,但查询中明确æå‡ºçš„éƒ¨åˆ†è¦æ±‚未在商å“ä¿¡æ¯ä¸­ä½“çŽ°ã€æ— æ³•确认,或存在轻微åå·® / éžå…³é”®å差。该商å“仿˜¯æ»¡è¶³ç”¨æˆ·æ ¸å¿ƒéœ€æ±‚的良好替代å“。
113 173
114 -在以下情况使用部分相关:  
115 -- 核心产å“类型匹é…,但部分请求的属性在商å“ä¿¡æ¯ä¸­ç¼ºå¤±ã€æœªæåŠæˆ–无法确认。  
116 -- 核心产å“类型匹é…,但æè´¨ã€ç‰ˆåž‹ã€é£Žæ ¼ç­‰æ¬¡è¦è¦æ±‚存在å差或ä¸ä¸€è‡´ã€‚  
117 -- 商å“䏿˜¯ç”¨æˆ·æœ€ç†æƒ³çš„目标,但从电商购物角度看,ä»å¯èƒ½è¢«ç”¨æˆ·è§†ä¸ºå¯æŽ¥å—的替代å“。 174 +在以下情况使用“基本相关â€ï¼š
  175 +- 核心商å“类型匹é…ï¼Œä½†éƒ¨åˆ†å±žæ€§ç¼ºå¤±ã€æœªæåŠæˆ–无法确认。
  176 +- 核心商å“类型匹é…ï¼Œä½†é¢œè‰²ã€æè´¨ã€é£Žæ ¼ã€ç‰ˆåž‹ã€é•¿åº¦ç­‰å±žæ€§å­˜åœ¨è½»å¾®å差,åªè¦è¿™ç§åå·®ä¸ä¼šæ˜Žæ˜¾ç ´å用户的主è¦è´­ä¹°æ„图。
  177 +- 商å“䏿˜¯ç”¨æˆ·æœ€ç†æƒ³çš„目标,但在电商购物场景下ä»å¯èƒ½è¢«è§†ä¸ºå¯æŽ¥å—ã€ä¸”较优的替代å“。
118 178
119 典型情况: 179 典型情况:
120 -- 查询:“红色修身Tæ¤â€ï¼Œäº§å“:“女士Tæ¤â€ → 颜色/版型无法确认。  
121 -- 查询:“红色修身Tæ¤â€ï¼Œäº§å“:“è“色修身Tæ¤â€ → 产å“类型和版型匹é…,但颜色ä¸åŒã€‚ 180 +- 查询:“红色修身Tæ¤â€ï¼Œå•†å“:“女士Tæ¤â€
  181 + → 颜色ã€ç‰ˆåž‹æ— æ³•确认。
  182 +- 查询:“红色修身Tæ¤â€ï¼Œå•†å“:“è“色修身Tæ¤â€
  183 + → 商å“类型和版型匹é…,但颜色ä¸åŒã€‚
122 184
123 详细案例: 185 详细案例:
124 - 查询:“棉质长袖衬衫†186 - 查询:“棉质长袖衬衫â€
125 - 商å“:“J.VERç”·å¼äºšéº»è¡¬è¡«ä¼‘闲纽扣长袖衬衫纯色平领å¤å­£æ²™æ»©è¡¬è¡«å¸¦å£è¢‹â€ 187 - 商å“:“J.VERç”·å¼äºšéº»è¡¬è¡«ä¼‘闲纽扣长袖衬衫纯色平领å¤å­£æ²™æ»©è¡¬è¡«å¸¦å£è¢‹â€
126 188
127 分æžï¼š 189 分æžï¼š
128 -- æè´¨ä¸ç¬¦ï¼šQuery 明确指定“棉质â€ï¼Œè€Œå•†å“为“亚麻â€ï¼Œå› æ­¤ä¸èƒ½åˆ¤ä¸ºå®Œå…¨ç›¸å…³ã€‚ 190 +- æè´¨ä¸ç¬¦ï¼šQuery 明确指定“棉质â€ï¼Œè€Œå•†å“为“亚麻â€ï¼Œå› æ­¤ä¸èƒ½åˆ¤ä¸ºâ€œå®Œå…¨ç›¸å…³â€ã€‚
129 - 但核心å“ç±»ä»ç„¶åŒ¹é…:两者都是“长袖衬衫â€ã€‚ 191 - 但核心å“ç±»ä»ç„¶åŒ¹é…:两者都是“长袖衬衫â€ã€‚
130 - 在电商æœç´¢ä¸­ï¼Œç”¨æˆ·ä»å¯èƒ½å› ä¸ºæ¬¾å¼ã€ç©¿ç€åœºæ™¯ç›¸è¿‘而点击该商å“。 192 - 在电商æœç´¢ä¸­ï¼Œç”¨æˆ·ä»å¯èƒ½å› ä¸ºæ¬¾å¼ã€ç©¿ç€åœºæ™¯ç›¸è¿‘而点击该商å“。
131 -- 因此应判为部分相关,å³â€œéžç›®æ ‡ä½†å¯æŽ¥å—â€çš„æ›¿ä»£å“。 193 +- 因此应判为“基本相关â€ï¼Œå³â€œéžç²¾ç¡®ç›®æ ‡ï¼Œä½†å±žäºŽè‰¯å¥½æ›¿ä»£å“â€ã€‚
132 194
133 -### ä¸ç›¸å…³  
134 -äº§å“æœªæ»¡è¶³ç”¨æˆ·çš„主è¦è´­ç‰©æ„图,主è¦è¡¨çŽ°ä¸ºä»¥ä¸‹æƒ…å½¢ä¹‹ä¸€ï¼š  
135 -- 核心产å“类型与查询ä¸åŒ¹é…。  
136 -- 产å“虽属大致相关的大类,但与查询指定的具体å­ç±»ä¸å¯äº’æ¢ã€‚  
137 -- 核心产å“类型匹é…ï¼Œä½†äº§å“æ˜Žæ˜¾è¿èƒŒäº†æŸ¥è¯¢ä¸­ä¸€ä¸ªæ˜Žç¡®ä¸”é‡è¦çš„å±žæ€§è¦æ±‚。 195 +详细案例:
  196 +- 查询:“黑色中长åŠèº«è£™â€
  197 +- 商å“ï¼šâ€œæ˜¥ç§‹å­£æ–°æ¬¾å®½æ¾æ˜¾ç˜¦å¤§æ‘†é•¿è£™ç¢ŽèбåŠèº«è£™è¤¶çš±è®¾è®¡è£™â€
  198 +
  199 +分æžï¼š
  200 +- å“类匹é…ï¼šå•†å“æ˜¯â€œåŠèº«è£™â€ï¼Œå“类符åˆã€‚
  201 +- 颜色ä¸åŒ¹é…ï¼šå•†å“æè¿°æœªæåŠé»‘色,且明确包å«â€œç¢Žèбâ€ï¼Œä¸Žçº¯é»‘差异较大。
  202 +- 长度存在åå·®ï¼šç”¨æˆ·è¦æ±‚“中长â€ï¼Œè€Œå•†å“标题强调“长裙â€ï¼Œé•¿åº¦å长。
  203 +- 但核心å“类“åŠèº«è£™â€åŒ¹é…,“显瘦â€â€œå¤§æ‘†â€ç­‰é£Žæ ¼ç‰¹å¾ä»å¯èƒ½ç¬¦åˆéƒ¨åˆ†æœç´¢â€œä¸­é•¿åŠèº«è£™â€ç”¨æˆ·çš„æ½œåœ¨åå¥½ï¼›åŒæ—¶â€œé•¿è£™â€å’Œâ€œä¸­é•¿â€è™½æœ‰åå·®ï¼Œä½†ä¸æž„æˆä¸¥é‡å¯¹ç«‹ã€‚
  204 +- 因此应判为“基本相关â€ï¼šæ ¸å¿ƒå“类匹é…,但存在若干éžè‡´å‘½å±žæ€§å差。
  205 +
  206 +### 弱相关
  207 +商å“与用户的核心目标存在明显差è·ï¼Œä½†ä»ä¸ŽæŸ¥è¯¢åœ¨é£Žæ ¼ã€åœºæ™¯ã€åŠŸèƒ½æˆ–å¤§ç±»ä¸Šå…·æœ‰ä¸€å®šç›¸ä¼¼æ€§ï¼Œå¯èƒ½è¢«å°‘é‡ç”¨æˆ·è§†ä¸ºå‹‰å¼ºå¯æŽ¥å—的替代å“。属于“éžç›®æ ‡ï¼Œä½†ä»æœ‰ä¸€å®šå…³è”â€ã€‚
  208 +
  209 +在以下情况使用“弱相关â€ï¼š
  210 +- 核心商å“类型ä¸ä¸€è‡´ï¼Œä½†ä¸¤è€…在风格ã€ç©¿ç€åœºæ™¯æˆ–功能上éžå¸¸æŽ¥è¿‘,ä»å…·æœ‰ä¸€å®šæ›¿ä»£æ€§ã€‚
  211 +- 核心商å“类型匹é…ï¼Œä½†åœ¨å¤šä¸ªå±žæ€§ä¸Šä¸Žç”¨æˆ·ç†æƒ³ç›®æ ‡å·®è·è¾ƒå¤§ï¼Œè™½ä»æœ‰ä¸€å®šå…³è”æ€§ï¼Œä½†å·²ä¸æ˜¯é«˜è´¨é‡æ›¿ä»£å“。
  212 +- æŸ¥è¯¢è¦æ±‚中的æŸä¸ªé‡è¦å±žæ€§è¢«æ˜Žæ˜¾è¿èƒŒï¼Œä½†å•†å“ä»ä¿ç•™å°‘é‡è¢«ç‚¹å‡»çš„ç†ç”±ã€‚
138 213
139 典型情况: 214 典型情况:
140 -- 查询:“裤å­â€ï¼Œäº§å“:“鞋å­â€ → 产å“类型错误。  
141 -- 查询:“连衣裙â€ï¼Œäº§å“:“åŠèº«è£™â€ → 具体产å“类型ä¸åŒã€‚  
142 -- 查询:“修身裤â€ï¼Œäº§å“:“宽æ¾é˜”腿裤†→ ä¸Žç‰ˆåž‹è¦æ±‚明显冲çªã€‚  
143 -- 查询:“无袖连衣裙â€ï¼Œäº§å“:“长袖连衣裙†→ ä¸Žè¢–åž‹è¦æ±‚明显冲çªã€‚ 215 +- 查询:“黑色中长åŠèº«è£™â€ï¼Œå•†å“:“新款高腰V领中长款连衣裙 优雅å°èŠ±é»‘è‰²æ€§æ„Ÿè¿žè¡£è£™â€
  216 + → 核心商å“类型“åŠèº«è£™â€ä¸Žâ€œè¿žè¡£è£™â€ä¸åŒï¼Œä½†ä¸¤è€…åŒå±žè£™è£…,且款å¼ä¸Šå‡ä¸ºâ€œä¸­é•¿æ¬¾â€ï¼Œåœ¨ç©¿æ­åœºæ™¯ä¸ŠæŽ¥è¿‘,因此属于“弱相关â€ã€‚
  217 +
  218 +- 查询:“牛仔裤â€ï¼Œå•†å“:“休闲裤â€
  219 + → 核心商å“类型ä¸åŒï¼Œä½†åŒå±žè£¤è£…大类,风格和穿ç€åœºæ™¯å¯èƒ½æŽ¥è¿‘,å¯ä½œä¸ºè¾ƒå¼±æ›¿ä»£å“。
  220 +
  221 +### ä¸ç›¸å…³
  222 +商哿œªæ»¡è¶³ç”¨æˆ·çš„主è¦è´­ç‰©æ„图,用户点击动机æžä½Žã€‚
  223 +
  224 +主è¦è¡¨çŽ°ä¸ºä»¥ä¸‹æƒ…å½¢ä¹‹ä¸€ï¼š
  225 +- 核心商å“类型与查询ä¸åŒ¹é…,且ä¸å±žäºŽé£Žæ ¼ / 场景 / åŠŸèƒ½æŽ¥è¿‘çš„å¯æ›¿ä»£å“。
  226 +- 商å“虽属于大致相关的大类,但与查询明确指定的具体å­ç±»ä¸å¯äº’æ¢ï¼Œä¸”风格或场景差异大。
  227 +- 核心商å“类型匹é…ï¼Œä½†å•†å“æ˜Žæ˜¾è¿èƒŒäº†æŸ¥è¯¢ä¸­ä¸€ä¸ªæ˜Žç¡®ä¸”é‡è¦çš„è¦æ±‚,且几乎ä¸å…·å¤‡å¯æŽ¥å—的替代性。
144 228
145 -该标签强调用户æ„图的明确性。当查询指å‘具体类型或关键属性时,å³ä½¿äº§å“在更高层级类别上相关,也应按ä¸ç›¸å…³å¤„ç†ã€‚ 229 +典型情况:
  230 +- 查询:“裤å­â€ï¼Œå•†å“:“鞋å­â€
  231 + → 商å“类型错误。
  232 +- 查询:“修身裤â€ï¼Œå•†å“:“宽æ¾é˜”腿裤â€
  233 + → ä¸Žç‰ˆåž‹è¦æ±‚明显冲çªï¼Œæ›¿ä»£æ€§æžä½Žã€‚
  234 +- 查询:“无袖连衣裙â€ï¼Œå•†å“:“长袖连衣裙â€
  235 + → ä¸Žè¢–åž‹è¦æ±‚明显冲çªã€‚
  236 +- 查询:“牛仔裤â€ï¼Œå•†å“:“è¿åŠ¨è£¤â€
  237 + → 核心å“ç±»ä¸åŒï¼ˆç‰›ä»”裤 vs è¿åŠ¨è£¤ï¼‰ï¼Œé£Žæ ¼å’Œåœºæ™¯å·®å¼‚å¤§ã€‚
  238 +- 查询:“é´å­â€ï¼Œå•†å“:“è¿åŠ¨éž‹â€
  239 + → 核心å“ç±»ä¸åŒï¼ŒåŠŸèƒ½å’Œé€‚ç”¨åœºæ™¯å·®å¼‚å¤§ã€‚
146 240
147 ## 判断原则 241 ## 判断原则
148 242
149 -1. 产å“类型是最高优先级因素。  
150 - 如果查询明确指定了具体产å“类型,那么结果必须匹é…该产å“类型,æ‰å¯èƒ½åˆ¤ä¸ºâ€œå®Œå…¨ç›¸å…³â€æˆ–“部分相关â€ã€‚  
151 - ä¸åŒäº§å“类型通常应判为“ä¸ç›¸å…³â€ï¼Œè€Œä¸æ˜¯â€œéƒ¨åˆ†ç›¸å…³â€ã€‚ 243 +1. **商å“类型是最高优先级因素。**
  244 + 如果查询明确指定了具体商å“类型,那么结果必须匹é…该商å“类型,æ‰å¯èƒ½åˆ¤ä¸ºâ€œå®Œå…¨ç›¸å…³â€æˆ–“基本相关â€ã€‚
  245 + ä¸åŒå•†å“ç±»åž‹é€šå¸¸åº”åˆ¤ä¸ºâ€œå¼±ç›¸å…³â€æˆ–“ä¸ç›¸å…³â€ã€‚
  246 +
  247 + - **弱相关**:仅当两ç§å•†å“类型在风格ã€åœºæ™¯ã€åŠŸèƒ½ä¸Šéžå¸¸æŽ¥è¿‘ï¼Œç”¨æˆ·æœ‰ä¸€å®šæ¦‚çŽ‡å°†å…¶è§†ä¸ºå‹‰å¼ºå¯æŽ¥å—çš„æ›¿ä»£å“æ—¶ä½¿ç”¨ã€‚
  248 + - **ä¸ç›¸å…³**:其他所有商å“类型ä¸åŒ¹é…的情况。
152 249
153 -2. 相似或相关的产å“类型,在查询明确时通常ä¸å¯äº’æ¢ã€‚ 250 +2. **相似或相关的商å“类型,在查询明确时通常ä¸å¯ç›´æŽ¥äº’æ¢ï¼Œä½†è¦æ ¹æ®æŽ¥è¿‘程度区分“弱相关â€ä¸Žâ€œä¸ç›¸å…³â€ã€‚**
154 例如: 251 例如:
155 - - 连衣裙 vs åŠèº«è£™ vs 连体裤  
156 - - 牛仔裤 vs è£¤å­  
157 - - Tæ¤ vs 衬衫/上衣  
158 - - 开衫 vs 毛衣  
159 - - é´å­ vs éž‹å­  
160 - - 文胸 vs 上衣  
161 - - åŒè‚©åŒ… vs 包  
162 - 如果用户明确æœç´¢å…¶ä¸­ä¸€ç§ï¼Œå…¶ä»–类型通常应判为“ä¸ç›¸å…³â€ã€‚  
163 -  
164 -3. 当核心产å“类型匹é…åŽï¼Œå†è¯„估属性。  
165 - - æ‰€æœ‰æ˜Žç¡®å±žæ€§éƒ½åŒ¹é… â†’ 完全相关  
166 - - éƒ¨åˆ†å±žæ€§ç¼ºå¤±ã€æ— æ³•确认,或存在一定åå·®ï¼Œä½†ä»æ˜¯å¯æŽ¥å—æ›¿ä»£å“ â†’ 部分相关  
167 - - 明确且é‡è¦çš„属性被明显è¿èƒŒï¼Œä¸”ä¸èƒ½ä½œä¸ºåˆç†æ›¿ä»£å“ → ä¸ç›¸å…³  
168 -  
169 -4. è¦ä¸¥æ ¼åŒºåˆ†â€œæœªæåŠ/无法确认â€å’Œâ€œæ˜Žç¡®å†²çªâ€ã€‚  
170 - - 如果æŸå±žæ€§æ²¡æœ‰æåŠï¼Œæˆ–无法验è¯ï¼Œä¼˜å…ˆåˆ¤ä¸ºâ€œéƒ¨åˆ†ç›¸å…³â€ã€‚  
171 - - 如果æŸå±žæ€§ä¸ŽæŸ¥è¯¢è¦æ±‚明确相å,则判为“ä¸ç›¸å…³â€ï¼›é™¤éžåœ¨è´­ç‰©è¯­å¢ƒä¸‹å®ƒä»æ˜Žæ˜¾å±žäºŽå¯æŽ¥å—替代å“。 252 + - **风格 / 场景高度接近,å¯åˆ¤ä¸ºå¼±ç›¸å…³**:连衣裙 vs åŠèº«è£™ã€é•¿è£™ vs 中长裙ã€ç‰›ä»”裤 vs 休闲裤ã€è¿åŠ¨éž‹ vs æ¿éž‹ã€‚
  253 + - **风格 / 场景差异大,应判为ä¸ç›¸å…³**ï¼šè£¤å­ vs éž‹å­ã€Tæ¤ vs 帽å­ã€é´å­ vs è¿åŠ¨éž‹ã€ç‰›ä»”裤 vs 西装裤ã€åŒè‚©åŒ… vs 手æåŒ…。
  254 +
  255 +3. **当核心商å“类型匹é…åŽï¼Œå†è¯„估属性。**
  256 + - æ‰€æœ‰æ˜Žç¡®å±žæ€§éƒ½åŒ¹é… â†’ **完全相关**
  257 + - éƒ¨åˆ†å±žæ€§ç¼ºå¤±ã€æœªæåŠã€æ— æ³•确认,或存在轻微åå·® → **基本相关**
  258 + - 存在多个属性å差,或æŸä¸ªé‡è¦å±žæ€§è¢«æ˜Žæ˜¾è¿èƒŒï¼Œä½†å•†å“ä»ä¿ç•™ä¸€å®šæ›¿ä»£æ€§ → **弱相关**
  259 + - 存在明确且é‡è¦çš„强冲çªï¼Œä¸”替代性æžä½Ž → **ä¸ç›¸å…³**
  260 +
  261 +4. **è¦ä¸¥æ ¼åŒºåˆ†â€œæœªæåŠ / 无法确认â€â€œè½»å¾®åå·®â€â€œæ˜Žç¡®å†²çªâ€ã€‚**
  262 + - 如果æŸå±žæ€§æ²¡æœ‰æåŠï¼Œæˆ–无法验è¯ï¼Œä¼˜å…ˆåˆ¤ä¸ºâ€œåŸºæœ¬ç›¸å…³â€ã€‚
  263 + - 如果æŸå±žæ€§å­˜åœ¨è½»å¾®å差,如颜色ä¸åŒã€æè´¨ä¸åŒã€é•¿åº¦ç•¥æœ‰å·®å¼‚,通常判为“基本相关â€ã€‚
  264 + - 如果æŸå±žæ€§ä¸ŽæŸ¥è¯¢è¦æ±‚明确相å,如无袖 vs 长袖ã€ä¿®èº« vs 宽æ¾é˜”è…¿ï¼Œåˆ™è¦æ ¹æ®å†²çªä¸¥é‡æ€§ä¸Žæ›¿ä»£æ€§ï¼Œåœ¨â€œå¼±ç›¸å…³â€ä¸Žâ€œä¸ç›¸å…³â€ä¹‹é—´åˆ¤æ–­ã€‚
  265 + - 若该冲çªä¼šç›´æŽ¥ç ´å用户的主è¦è´­ä¹°ç›®æ ‡ï¼Œé€šå¸¸åˆ¤ä¸ºâ€œä¸ç›¸å…³â€ã€‚
  266 +
  267 +5. **“是å¦å¯æ›¿ä»£â€åº”从真实电商购物æ„图出å‘判断。**
  268 + 䏿˜¯åªçœ‹å­—é¢ç›¸ä¼¼ï¼Œè€Œè¦çœ‹ç”¨æˆ·åœ¨è´­ç‰©åœºæ™¯ä¸‹æ˜¯å¦å¯èƒ½æŽ¥å—该商å“。
  269 + - è‰¯å¥½æ›¿ä»£å“ â†’ **基本相关**
  270 + - å‹‰å¼ºæ›¿ä»£å“ â†’ **弱相关**
  271 + - 几乎ä¸å¯æ›¿ä»£ → **ä¸ç›¸å…³**
  272 +
  273 +6. **若商å“ä¿¡æ¯ä¸è¶³ï¼Œä¸è¦æŠŠâ€œæ— æ³•确认â€è¯¯åˆ¤ä¸ºâ€œå†²çªâ€ã€‚**
  274 + 商哿œªå†™æ˜ŽæŸå±žæ€§ï¼Œä¸ç­‰äºŽè¯¥å±žæ€§ä¸€å®šä¸ç¬¦åˆã€‚
  275 + 因此:
  276 + - 未æåŠ / 无法确认,优先按“基本相关â€å¤„ç†ï¼›
  277 + - åªæœ‰å½“商å“ä¿¡æ¯æ˜Žç¡®æ˜¾ç¤ºä¸ŽæŸ¥è¯¢è¦æ±‚ç›¸åæ—¶ï¼Œæ‰è§†ä¸ºå±žæ€§å†²çªã€‚
172 278
173 查询:{query} 279 查询:{query}
174 280
@@ -176,9 +282,10 @@ _CLASSIFY_BATCH_SIMPLE_TEMPLATE__zh = _CLASSIFY_BATCH_SIMPLE_TEMPLATE_ZH = &quot;&quot;&quot;ä½ @@ -176,9 +282,10 @@ _CLASSIFY_BATCH_SIMPLE_TEMPLATE__zh = _CLASSIFY_BATCH_SIMPLE_TEMPLATE_ZH = &quot;&quot;&quot;ä½
176 {lines} 282 {lines}
177 283
178 ## è¾“å‡ºæ ¼å¼ 284 ## 输出格å¼
179 -严格输出 {n} 行,æ¯è¡Œåªèƒ½æ˜¯ä»¥ä¸‹ä¸‰è€…之一: 285 +严格输出 {n} 行,æ¯è¡Œåªèƒ½æ˜¯ä»¥ä¸‹å››è€…之一:
180 完全相关 286 完全相关
181 -部分相关 287 +基本相关
  288 +弱相关
182 ä¸ç›¸å…³ 289 ä¸ç›¸å…³
183 290
184 输出行必须与上方商å“顺åºä¸€ä¸€å¯¹åº”。 291 输出行必须与上方商å“顺åºä¸€ä¸€å¯¹åº”。
@@ -186,77 +293,7 @@ _CLASSIFY_BATCH_SIMPLE_TEMPLATE__zh = _CLASSIFY_BATCH_SIMPLE_TEMPLATE_ZH = &quot;&quot;&quot;ä½ @@ -186,77 +293,7 @@ _CLASSIFY_BATCH_SIMPLE_TEMPLATE__zh = _CLASSIFY_BATCH_SIMPLE_TEMPLATE_ZH = &quot;&quot;&quot;ä½
186 """ 293 """
187 294
188 295
189 -  
190 -def classify_batch_simple_prompt(query: str, numbered_doc_lines: Sequence[str]) -> str: 296 +def classify_prompt(query: str, numbered_doc_lines: Sequence[str]) -> str:
191 lines = "\n".join(numbered_doc_lines) 297 lines = "\n".join(numbered_doc_lines)
192 n = len(numbered_doc_lines) 298 n = len(numbered_doc_lines)
193 - return _CLASSIFY_BATCH_SIMPLE_TEMPLATE.format(query=query, lines=lines, n=n)  
194 -  
195 -  
196 -_EXTRACT_QUERY_PROFILE_TEMPLATE = """You are building a structured intent profile for e-commerce relevance judging.  
197 -Use the original user query as the source of truth. Parser hints may help, but if a hint conflicts with the original query, trust the original query.  
198 -Be conservative: only mark an attribute as required if the user explicitly asked for it.  
199 -  
200 -Return JSON with this schema:  
201 -{{  
202 - "normalized_query_en": string,  
203 - "primary_category": string,  
204 - "allowed_categories": [string],  
205 - "required_attributes": [  
206 - {{"name": string, "required_terms": [string], "conflicting_terms": [string], "match_mode": "explicit"}}  
207 - ],  
208 - "notes": [string]  
209 -}}  
210 -  
211 -Guidelines:  
212 -- Exact later will require explicit evidence for all required attributes.  
213 -- allowed_categories should contain only near-synonyms of the same product type, not substitutes. For example dress can allow midi dress/cocktail dress, but not skirt, top, jumpsuit, or outfit unless the query explicitly asks for them.  
214 -- If the query asks for dress/skirt/jeans/t-shirt, near but different product types are not Exact.  
215 -- If the query includes color, fit, silhouette, or length, include them as required_attributes.  
216 -- For fit words, include conflicting terms when obvious, e.g. fitted conflicts with oversized/loose; oversized conflicts with fitted/tight.  
217 -- For color, include conflicting colors only when clear from the query.  
218 -  
219 -Original query: {query}  
220 -Parser hints JSON: {hints_json}  
221 -"""  
222 -  
223 -  
224 -def extract_query_profile_prompt(query: str, parser_hints: Dict[str, Any]) -> str:  
225 - hints_json = json.dumps(parser_hints, ensure_ascii=False)  
226 - return _EXTRACT_QUERY_PROFILE_TEMPLATE.format(query=query, hints_json=hints_json)  
227 -  
228 -  
229 -_CLASSIFY_BATCH_COMPLEX_TEMPLATE = """You are an e-commerce search relevance judge.  
230 -Judge each product against the structured query profile below.  
231 -  
232 -Relevance rules:  
233 -- Exact: product type matches the target intent, and every explicit required attribute is positively supported by the title/options/tags/category. If an attribute is missing or only guessed, it is NOT Exact.  
234 -- Partial: main product type/use case matches, but some required attribute is missing, weaker, uncertain, or only approximately matched.  
235 -- Irrelevant: product type/use case mismatched, or an explicit required attribute clearly conflicts.  
236 -- Be conservative with Exact.  
237 -- Graphic/holiday/message tees are not Exact for a plain color/style tee query unless that graphic/theme was requested.  
238 -- Jumpsuit/romper/set is not Exact for dress/skirt/jeans queries.  
239 -  
240 -Original query: {query}  
241 -Structured query profile JSON: {profile_json}  
242 -  
243 -Products:  
244 -{lines}  
245 -  
246 -Return JSON only, with schema:  
247 -{{"labels":[{{"index":1,"label":"Exact","reason":"short phrase"}}]}}  
248 -"""  
249 -  
250 -  
251 -def classify_batch_complex_prompt(  
252 - query: str,  
253 - query_profile: Dict[str, Any],  
254 - numbered_doc_lines: Sequence[str],  
255 -) -> str:  
256 - lines = "\n".join(numbered_doc_lines)  
257 - profile_json = json.dumps(query_profile, ensure_ascii=False)  
258 - return _CLASSIFY_BATCH_COMPLEX_TEMPLATE.format(  
259 - query=query,  
260 - profile_json=profile_json,  
261 - lines=lines,  
262 - ) 299 + return _CLASSIFY_TEMPLATE_EN.format(query=query, lines=lines, n=n)
scripts/evaluation/eval_framework/reports.py
@@ -4,7 +4,7 @@ from __future__ import annotations @@ -4,7 +4,7 @@ from __future__ import annotations
4 4
5 from typing import Any, Dict 5 from typing import Any, Dict
6 6
7 -from .constants import RELEVANCE_EXACT, RELEVANCE_IRRELEVANT, RELEVANCE_PARTIAL 7 +from .constants import RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_IRRELEVANT, RELEVANCE_LOW
8 8
9 9
10 def render_batch_report_markdown(payload: Dict[str, Any]) -> str: 10 def render_batch_report_markdown(payload: Dict[str, Any]) -> str:
@@ -29,8 +29,9 @@ def render_batch_report_markdown(payload: Dict[str, Any]) -&gt; str: @@ -29,8 +29,9 @@ def render_batch_report_markdown(payload: Dict[str, Any]) -&gt; str:
29 "", 29 "",
30 "## Label Distribution", 30 "## Label Distribution",
31 "", 31 "",
32 - f"- Exact: {distribution.get(RELEVANCE_EXACT, 0)}",  
33 - f"- Partial: {distribution.get(RELEVANCE_PARTIAL, 0)}", 32 + f"- Exact Match: {distribution.get(RELEVANCE_EXACT, 0)}",
  33 + f"- High Relevant: {distribution.get(RELEVANCE_HIGH, 0)}",
  34 + f"- Low Relevant: {distribution.get(RELEVANCE_LOW, 0)}",
34 f"- Irrelevant: {distribution.get(RELEVANCE_IRRELEVANT, 0)}", 35 f"- Irrelevant: {distribution.get(RELEVANCE_IRRELEVANT, 0)}",
35 ] 36 ]
36 ) 37 )
@@ -41,8 +42,9 @@ def render_batch_report_markdown(payload: Dict[str, Any]) -&gt; str: @@ -41,8 +42,9 @@ def render_batch_report_markdown(payload: Dict[str, Any]) -&gt; str:
41 for key, value in sorted((item.get("metrics") or {}).items()): 42 for key, value in sorted((item.get("metrics") or {}).items()):
42 lines.append(f"- {key}: {value}") 43 lines.append(f"- {key}: {value}")
43 distribution = item.get("distribution") or {} 44 distribution = item.get("distribution") or {}
44 - lines.append(f"- Exact: {distribution.get(RELEVANCE_EXACT, 0)}")  
45 - lines.append(f"- Partial: {distribution.get(RELEVANCE_PARTIAL, 0)}") 45 + lines.append(f"- Exact Match: {distribution.get(RELEVANCE_EXACT, 0)}")
  46 + lines.append(f"- High Relevant: {distribution.get(RELEVANCE_HIGH, 0)}")
  47 + lines.append(f"- Low Relevant: {distribution.get(RELEVANCE_LOW, 0)}")
46 lines.append(f"- Irrelevant: {distribution.get(RELEVANCE_IRRELEVANT, 0)}") 48 lines.append(f"- Irrelevant: {distribution.get(RELEVANCE_IRRELEVANT, 0)}")
47 lines.append("") 49 lines.append("")
48 return "\n".join(lines) 50 return "\n".join(lines)
scripts/evaluation/eval_framework/static/eval_web.css
@@ -35,12 +35,14 @@ @@ -35,12 +35,14 @@
35 .results { display: grid; gap: 10px; } 35 .results { display: grid; gap: 10px; }
36 .result { display: grid; grid-template-columns: 110px 100px 1fr; gap: 14px; align-items: center; background: var(--panel); border: 1px solid var(--line); border-radius: 18px; padding: 12px; } 36 .result { display: grid; grid-template-columns: 110px 100px 1fr; gap: 14px; align-items: center; background: var(--panel); border: 1px solid var(--line); border-radius: 18px; padding: 12px; }
37 .badge { display: inline-block; padding: 8px 10px; border-radius: 999px; color: white; font-weight: 700; text-align: center; } 37 .badge { display: inline-block; padding: 8px 10px; border-radius: 999px; color: white; font-weight: 700; text-align: center; }
38 - .Exact { background: var(--exact); }  
39 - .Partial { background: var(--partial); }  
40 - .Irrelevant { background: var(--irrelevant); }  
41 - .Unknown { background: #637381; } 38 + .label-exact-match { background: var(--exact); }
  39 + .label-high-relevant { background: var(--partial); }
  40 + .label-low-relevant { background: #6b5b95; }
  41 + .label-irrelevant { background: var(--irrelevant); }
  42 + .badge-unknown { background: #637381; }
42 .thumb { width: 100px; height: 100px; object-fit: cover; border-radius: 14px; background: #e7e1d4; } 43 .thumb { width: 100px; height: 100px; object-fit: cover; border-radius: 14px; background: #e7e1d4; }
43 - .title { font-size: 16px; font-weight: 700; margin-bottom: 8px; } 44 + .title { font-size: 16px; font-weight: 700; margin-bottom: 4px; }
  45 + .title-zh { font-size: 14px; font-weight: 500; color: var(--muted); margin-bottom: 8px; line-height: 1.4; }
44 .options { color: var(--muted); line-height: 1.5; font-size: 14px; } 46 .options { color: var(--muted); line-height: 1.5; font-size: 14px; }
45 .section { margin-bottom: 28px; } 47 .section { margin-bottom: 28px; }
46 .history { font-size: 13px; line-height: 1.5; } 48 .history { font-size: 13px; line-height: 1.5; }
scripts/evaluation/eval_framework/static/eval_web.js
@@ -13,6 +13,10 @@ @@ -13,6 +13,10 @@
13 root.appendChild(card); 13 root.appendChild(card);
14 }); 14 });
15 } 15 }
  16 + function labelBadgeClass(label) {
  17 + if (!label || label === 'Unknown') return 'badge-unknown';
  18 + return 'label-' + String(label).toLowerCase().replace(/\s+/g, '-');
  19 + }
16 function renderResults(results, rootId='results', showRank=true) { 20 function renderResults(results, rootId='results', showRank=true) {
17 const mount = document.getElementById(rootId); 21 const mount = document.getElementById(rootId);
18 mount.innerHTML = ''; 22 mount.innerHTML = '';
@@ -21,10 +25,11 @@ @@ -21,10 +25,11 @@
21 const box = document.createElement('div'); 25 const box = document.createElement('div');
22 box.className = 'result'; 26 box.className = 'result';
23 box.innerHTML = ` 27 box.innerHTML = `
24 - <div><span class="badge ${label}">${label}</span><div class="muted" style="margin-top:8px">${showRank ? `#${item.rank || '-'}` : (item.rerank_score != null ? `rerank=${item.rerank_score.toFixed ? item.rerank_score.toFixed(4) : item.rerank_score}` : 'not recalled')}</div></div> 28 + <div><span class="badge ${labelBadgeClass(label)}">${label}</span><div class="muted" style="margin-top:8px">${showRank ? `#${item.rank || '-'}` : (item.rerank_score != null ? `rerank=${item.rerank_score.toFixed ? item.rerank_score.toFixed(4) : item.rerank_score}` : 'not recalled')}</div></div>
25 <img class="thumb" src="${item.image_url || ''}" alt="" /> 29 <img class="thumb" src="${item.image_url || ''}" alt="" />
26 <div> 30 <div>
27 <div class="title">${item.title || ''}</div> 31 <div class="title">${item.title || ''}</div>
  32 + ${item.title_zh ? `<div class="title-zh">${item.title_zh}</div>` : ''}
28 <div class="options"> 33 <div class="options">
29 <div>${(item.option_values || [])[0] || ''}</div> 34 <div>${(item.option_values || [])[0] || ''}</div>
30 <div>${(item.option_values || [])[1] || ''}</div> 35 <div>${(item.option_values || [])[1] || ''}</div>
@@ -41,7 +46,7 @@ @@ -41,7 +46,7 @@
41 const root = document.getElementById('tips'); 46 const root = document.getElementById('tips');
42 const tips = [...(data.tips || [])]; 47 const tips = [...(data.tips || [])];
43 const stats = data.label_stats || {}; 48 const stats = data.label_stats || {};
44 - tips.unshift(`Cached labels for query: ${stats.total || 0}. Recalled hits: ${stats.recalled_hits || 0}. Missed Exact: ${stats.missing_exact_count || 0}. Missed Partial: ${stats.missing_partial_count || 0}.`); 49 + tips.unshift(`Cached labels for query: ${stats.total || 0}. Recalled hits: ${stats.recalled_hits || 0}. Missed (non-irrelevant): ${stats.missing_relevant_count || 0} — Exact: ${stats.missing_exact_count || 0}, High: ${stats.missing_high_count || 0}, Low: ${stats.missing_low_count || 0}.`);
45 root.innerHTML = tips.map(text => `<div class="tip">${text}</div>`).join(''); 50 root.innerHTML = tips.map(text => `<div class="tip">${text}</div>`).join('');
46 } 51 }
47 async function loadQueries() { 52 async function loadQueries() {
scripts/evaluation/eval_framework/static/index.html
@@ -37,7 +37,7 @@ @@ -37,7 +37,7 @@
37 <div id="results" class="results"></div> 37 <div id="results" class="results"></div>
38 </section> 38 </section>
39 <section class="section"> 39 <section class="section">
40 - <h2>Missed Exact / Partial</h2> 40 + <h2>Missed non-irrelevant (cached)</h2>
41 <div id="missingRelevant" class="results"></div> 41 <div id="missingRelevant" class="results"></div>
42 </section> 42 </section>
43 <section class="section"> 43 <section class="section">
scripts/evaluation/eval_framework/store.py
@@ -8,7 +8,7 @@ from dataclasses import dataclass @@ -8,7 +8,7 @@ from dataclasses import dataclass
8 from pathlib import Path 8 from pathlib import Path
9 from typing import Any, Dict, List, Optional, Sequence 9 from typing import Any, Dict, List, Optional, Sequence
10 10
11 -from .constants import VALID_LABELS 11 +from .constants import VALID_LABELS, normalize_stored_label
12 from .utils import ensure_dir, safe_json_dumps, utc_now_iso 12 from .utils import ensure_dir, safe_json_dumps, utc_now_iso
13 13
14 14
@@ -220,7 +220,7 @@ class EvalStore: @@ -220,7 +220,7 @@ class EvalStore:
220 """, 220 """,
221 (tenant_id, query_text), 221 (tenant_id, query_text),
222 ).fetchall() 222 ).fetchall()
223 - return {str(row["spu_id"]): str(row["label"]) for row in rows} 223 + return {str(row["spu_id"]): normalize_stored_label(str(row["label"])) for row in rows}
224 224
225 def upsert_labels( 225 def upsert_labels(
226 self, 226 self,
@@ -379,8 +379,9 @@ class EvalStore: @@ -379,8 +379,9 @@ class EvalStore:
379 SELECT 379 SELECT
380 query_text, 380 query_text,
381 COUNT(*) AS total, 381 COUNT(*) AS total,
382 - SUM(CASE WHEN label='Exact' THEN 1 ELSE 0 END) AS exact_count,  
383 - SUM(CASE WHEN label='Partial' THEN 1 ELSE 0 END) AS partial_count, 382 + SUM(CASE WHEN label IN ('Exact Match','Exact') THEN 1 ELSE 0 END) AS exact_count,
  383 + SUM(CASE WHEN label IN ('High Relevant','Partial') THEN 1 ELSE 0 END) AS high_relevant_count,
  384 + SUM(CASE WHEN label='Low Relevant' THEN 1 ELSE 0 END) AS low_relevant_count,
384 SUM(CASE WHEN label='Irrelevant' THEN 1 ELSE 0 END) AS irrelevant_count, 385 SUM(CASE WHEN label='Irrelevant' THEN 1 ELSE 0 END) AS irrelevant_count,
385 MAX(updated_at) AS updated_at 386 MAX(updated_at) AS updated_at
386 FROM relevance_labels 387 FROM relevance_labels
@@ -395,7 +396,8 @@ class EvalStore: @@ -395,7 +396,8 @@ class EvalStore:
395 "query": str(row["query_text"]), 396 "query": str(row["query_text"]),
396 "total": int(row["total"]), 397 "total": int(row["total"]),
397 "exact_count": int(row["exact_count"] or 0), 398 "exact_count": int(row["exact_count"] or 0),
398 - "partial_count": int(row["partial_count"] or 0), 399 + "high_relevant_count": int(row["high_relevant_count"] or 0),
  400 + "low_relevant_count": int(row["low_relevant_count"] or 0),
399 "irrelevant_count": int(row["irrelevant_count"] or 0), 401 "irrelevant_count": int(row["irrelevant_count"] or 0),
400 "updated_at": row["updated_at"], 402 "updated_at": row["updated_at"],
401 } 403 }
@@ -407,8 +409,9 @@ class EvalStore: @@ -407,8 +409,9 @@ class EvalStore:
407 """ 409 """
408 SELECT 410 SELECT
409 COUNT(*) AS total, 411 COUNT(*) AS total,
410 - SUM(CASE WHEN label='Exact' THEN 1 ELSE 0 END) AS exact_count,  
411 - SUM(CASE WHEN label='Partial' THEN 1 ELSE 0 END) AS partial_count, 412 + SUM(CASE WHEN label IN ('Exact Match','Exact') THEN 1 ELSE 0 END) AS exact_count,
  413 + SUM(CASE WHEN label IN ('High Relevant','Partial') THEN 1 ELSE 0 END) AS high_relevant_count,
  414 + SUM(CASE WHEN label='Low Relevant' THEN 1 ELSE 0 END) AS low_relevant_count,
412 SUM(CASE WHEN label='Irrelevant' THEN 1 ELSE 0 END) AS irrelevant_count, 415 SUM(CASE WHEN label='Irrelevant' THEN 1 ELSE 0 END) AS irrelevant_count,
413 MAX(updated_at) AS updated_at 416 MAX(updated_at) AS updated_at
414 FROM relevance_labels 417 FROM relevance_labels
@@ -420,7 +423,8 @@ class EvalStore: @@ -420,7 +423,8 @@ class EvalStore:
420 "query": query_text, 423 "query": query_text,
421 "total": int((row["total"] or 0) if row else 0), 424 "total": int((row["total"] or 0) if row else 0),
422 "exact_count": int((row["exact_count"] or 0) if row else 0), 425 "exact_count": int((row["exact_count"] or 0) if row else 0),
423 - "partial_count": int((row["partial_count"] or 0) if row else 0), 426 + "high_relevant_count": int((row["high_relevant_count"] or 0) if row else 0),
  427 + "low_relevant_count": int((row["low_relevant_count"] or 0) if row else 0),
424 "irrelevant_count": int((row["irrelevant_count"] or 0) if row else 0), 428 "irrelevant_count": int((row["irrelevant_count"] or 0) if row else 0),
425 "updated_at": row["updated_at"] if row else None, 429 "updated_at": row["updated_at"] if row else None,
426 } 430 }
scripts/evaluation/eval_framework/utils.py
@@ -42,6 +42,14 @@ def pick_text(value: Any, preferred_lang: str = &quot;en&quot;) -&gt; str: @@ -42,6 +42,14 @@ def pick_text(value: Any, preferred_lang: str = &quot;en&quot;) -&gt; str:
42 return str(value).strip() 42 return str(value).strip()
43 43
44 44
  45 +def zh_title_from_multilingual(title_multilingual: Any) -> str:
  46 + """Chinese title string from API debug ``title_multilingual`` (ES-style dict)."""
  47 + if not isinstance(title_multilingual, dict):
  48 + return ""
  49 + zh = str(title_multilingual.get("zh") or "").strip()
  50 + return zh
  51 +
  52 +
45 def safe_json_dumps(data: Any) -> str: 53 def safe_json_dumps(data: Any) -> str:
46 return json.dumps(data, ensure_ascii=False, separators=(",", ":")) 54 return json.dumps(data, ensure_ascii=False, separators=(",", ":"))
47 55
scripts/evaluation/quick_start_eval.sh renamed to scripts/evaluation/start_eval.sh
@@ -10,7 +10,7 @@ QUERIES=&quot;${REPO_EVAL_QUERIES:-scripts/evaluation/queries/queries.txt}&quot; @@ -10,7 +10,7 @@ QUERIES=&quot;${REPO_EVAL_QUERIES:-scripts/evaluation/queries/queries.txt}&quot;
10 10
11 usage() { 11 usage() {
12 echo "Usage: $0 batch|batch-rebuild|serve" 12 echo "Usage: $0 batch|batch-rebuild|serve"
13 - echo " batch — batch eval: live search every query, LLM only for missing labels (top_k=50, simple)" 13 + echo " batch — batch eval: live search every query, LLM only for missing labels (top_k=50)"
14 echo " batch-rebuild — deep rebuild: build --force-refresh-labels (search recall pool + full-corpus rerank + batched LLM; expensive)" 14 echo " batch-rebuild — deep rebuild: build --force-refresh-labels (search recall pool + full-corpus rerank + batched LLM; expensive)"
15 echo " serve — eval UI (default http://0.0.0.0:\${EVAL_WEB_PORT:-6010}/; also: ./scripts/start_eval_web.sh)" 15 echo " serve — eval UI (default http://0.0.0.0:\${EVAL_WEB_PORT:-6010}/; also: ./scripts/start_eval_web.sh)"
16 echo "Env: TENANT_ID (default 163), REPO_EVAL_QUERIES, EVAL_WEB_HOST, EVAL_WEB_PORT (default 6010)" 16 echo "Env: TENANT_ID (default 163), REPO_EVAL_QUERIES, EVAL_WEB_HOST, EVAL_WEB_PORT (default 6010)"
@@ -22,8 +22,7 @@ case &quot;${1:-}&quot; in @@ -22,8 +22,7 @@ case &quot;${1:-}&quot; in
22 --tenant-id "$TENANT_ID" \ 22 --tenant-id "$TENANT_ID" \
23 --queries-file "$QUERIES" \ 23 --queries-file "$QUERIES" \
24 --top-k 50 \ 24 --top-k 50 \
25 - --language en \  
26 - --labeler-mode simple 25 + --language en
27 ;; 26 ;;
28 batch-rebuild) 27 batch-rebuild)
29 exec "$PY" scripts/evaluation/build_annotation_set.py build \ 28 exec "$PY" scripts/evaluation/build_annotation_set.py build \
@@ -33,8 +32,7 @@ case &quot;${1:-}&quot; in @@ -33,8 +32,7 @@ case &quot;${1:-}&quot; in
33 --rerank-depth 10000 \ 32 --rerank-depth 10000 \
34 --force-refresh-rerank \ 33 --force-refresh-rerank \
35 --force-refresh-labels \ 34 --force-refresh-labels \
36 - --language en \  
37 - --labeler-mode simple 35 + --language en
38 ;; 36 ;;
39 serve) 37 serve)
40 EVAL_WEB_PORT="${EVAL_WEB_PORT:-6010}" 38 EVAL_WEB_PORT="${EVAL_WEB_PORT:-6010}"
tests/test_product_enrich_partial_mode.py
@@ -322,6 +322,109 @@ def test_analyze_products_uses_product_level_cache_across_batch_requests(): @@ -322,6 +322,109 @@ def test_analyze_products_uses_product_level_cache_across_batch_requests():
322 assert third[1]["anchor_text"] == "anchor:shirt" 322 assert third[1]["anchor_text"] == "anchor:shirt"
323 323
324 324
  325 +def test_analyze_products_reuses_cached_content_with_current_product_identity():
  326 + cached_result = {
  327 + "id": "1165",
  328 + "lang": "zh",
  329 + "title_input": "old-title",
  330 + "title": "法式连衣裙",
  331 + "category_path": "女装>连衣裙",
  332 + "enriched_tags": "法式,收腰",
  333 + "target_audience": "年轻女性",
  334 + "usage_scene": "通勤,约会",
  335 + "season": "春季,夏季",
  336 + "key_attributes": "中长款",
  337 + "material": "聚酯纤维",
  338 + "features": "透气",
  339 + "anchor_text": "法式收腰连衣裙",
  340 + }
  341 + products = [{"id": "69960", "title": "dress"}]
  342 +
  343 + with mock.patch.object(product_enrich, "API_KEY", "fake-key"), mock.patch.object(
  344 + product_enrich,
  345 + "_get_cached_anchor_result",
  346 + wraps=lambda product, target_lang: product_enrich._normalize_analysis_result(
  347 + cached_result,
  348 + product=product,
  349 + target_lang=target_lang,
  350 + ),
  351 + ), mock.patch.object(
  352 + product_enrich,
  353 + "process_batch",
  354 + side_effect=AssertionError("process_batch should not be called on cache hit"),
  355 + ):
  356 + result = product_enrich.analyze_products(
  357 + products,
  358 + target_lang="zh",
  359 + tenant_id="170",
  360 + )
  361 +
  362 + assert result == [
  363 + {
  364 + "id": "69960",
  365 + "lang": "zh",
  366 + "title_input": "dress",
  367 + "title": "法式连衣裙",
  368 + "category_path": "女装>连衣裙",
  369 + "tags": "法式,收腰",
  370 + "target_audience": "年轻女性",
  371 + "usage_scene": "通勤,约会",
  372 + "season": "春季,夏季",
  373 + "key_attributes": "中长款",
  374 + "material": "聚酯纤维",
  375 + "features": "透气",
  376 + "anchor_text": "法式收腰连衣裙",
  377 + }
  378 + ]
  379 +
  380 +
  381 +def test_build_index_content_fields_maps_internal_tags_to_enriched_tags_output():
  382 + def fake_analyze_products(products, target_lang="zh", batch_size=None, tenant_id=None):
  383 + return [
  384 + {
  385 + "id": products[0]["id"],
  386 + "lang": target_lang,
  387 + "title_input": products[0]["title"],
  388 + "title": products[0]["title"],
  389 + "category_path": "玩具>滑行玩具",
  390 + "tags": f"{target_lang}-tag1,{target_lang}-tag2",
  391 + "target_audience": f"{target_lang}-audience",
  392 + "usage_scene": "",
  393 + "season": "",
  394 + "key_attributes": "",
  395 + "material": "",
  396 + "features": "",
  397 + "anchor_text": f"{target_lang}-anchor",
  398 + }
  399 + ]
  400 +
  401 + with mock.patch.object(
  402 + product_enrich,
  403 + "analyze_products",
  404 + side_effect=fake_analyze_products,
  405 + ):
  406 + result = product_enrich.build_index_content_fields(
  407 + items=[{"spu_id": "69960", "title": "dress"}],
  408 + tenant_id="170",
  409 + )
  410 +
  411 + assert result == [
  412 + {
  413 + "id": "69960",
  414 + "qanchors": {"zh": ["zh-anchor"], "en": ["en-anchor"]},
  415 + "enriched_tags": {"zh": ["zh-tag1", "zh-tag2"], "en": ["en-tag1", "en-tag2"]},
  416 + "enriched_attributes": [
  417 + {"name": "enriched_tags", "value": {"zh": "zh-tag1"}},
  418 + {"name": "enriched_tags", "value": {"zh": "zh-tag2"}},
  419 + {"name": "target_audience", "value": {"zh": "zh-audience"}},
  420 + {"name": "enriched_tags", "value": {"en": "en-tag1"}},
  421 + {"name": "enriched_tags", "value": {"en": "en-tag2"}},
  422 + {"name": "target_audience", "value": {"en": "en-audience"}},
  423 + ],
  424 + }
  425 + ]
  426 +
  427 +
325 def test_anchor_cache_key_depends_on_product_input_not_identifiers(): 428 def test_anchor_cache_key_depends_on_product_input_not_identifiers():
326 product_a = { 429 product_a = {
327 "id": "1", 430 "id": "1",