Commit 5c9baf9193552091e5714709721f6ddfa6c3d4a8

Authored by tangwang
1 parent d6c29734

feat(search): 款式意图下统一 SKU 选取(option/taxonomy/图像)与属性值匹配增强

## 主要能力
- 在 rerank 窗口内对 hits 做 SKU 预决策:款式意图(多源同义词)+ 图像 KNN inner_hits URL 对齐 SKU.image_src,统一一次决策、无级联 fallback。
- 区分文本证据强度:final_source ∈ {option, taxonomy, image, none};matched_sources 按意图记录 option 或 taxonomy;selected_text / rerank_suffix 回填真实命中片段(SKU option 原文或 taxonomy value 原文)。
- 权威规则:SKU 在已解析维度上有非空 option 值时仅以该值参与匹配;SPU 级 enriched_taxonomy_attributes 不覆盖与之一致的 SKU 级矛盾值(修复「taxonomy 把白色 SKU 当卡其色命中」)。
- 图像:nested image KNN / exact rescore 增加 inner_hits(url),用于 SKU 置顶时的视觉 tie-break(仅在文本命中集内)或无意图时纯图像置顶。
- 查询侧:DetectedStyleIntent 增加 all_terms(zh+en+attribute 并集),属性值匹配与意图词表一致。
- API:SpuResult 透出 enriched_attributes / enriched_taxonomy_attributes(避免 Pydantic 丢弃 ES 字段)。

## 属性值匹配(括号和分隔符)
- 在分词前对归一化后的 option/taxonomy 字符串执行 _with_segment_boundaries_for_matching:将全/半角括号、斜杠、顿号、中英文标点、中点、各类横线等替换为空格,再 simple_tokenize + 滑窗;无分隔的连续汉字仍走纯中文子串回退(如 卡其色棉)。
- 参数化测试覆盖多种括号与常见电商分隔写法。

## 编排与配置
- searcher:_should_run_sku_selection = 款式意图激活 或 存在 image_query_vector;prefetch _source 含 skus、option 名、enriched_taxonomy_attributes。
- es_query_builder:image knn / exact image rescore 的 nested 子句带 inner_hits。

## 测试与仓库
- tests/test_sku_intent_selector.py、tests/test_search_rerank_window.py 更新;移除已废弃的 embedding-fallback 集成断言。
- .gitignore:忽略 artifacts/search_evaluation/datasets/(本地评估大数据集,避免误提交)。

Made-with: Cursor
@@ -82,3 +82,4 @@ model_cache/ @@ -82,3 +82,4 @@ model_cache/
82 artifacts/search_evaluation/*.sqlite3 82 artifacts/search_evaluation/*.sqlite3
83 artifacts/search_evaluation/batch_reports/ 83 artifacts/search_evaluation/batch_reports/
84 artifacts/search_evaluation/tuning_runs/ 84 artifacts/search_evaluation/tuning_runs/
  85 +artifacts/search_evaluation/datasets/
@@ -276,6 +276,14 @@ class SpuResult(BaseModel): @@ -276,6 +276,14 @@ class SpuResult(BaseModel):
276 None, 276 None,
277 description="规格列表(与 ES specifications 字段对应)" 277 description="规格列表(与 ES specifications 字段对应)"
278 ) 278 )
  279 + enriched_attributes: Optional[List[Dict[str, Any]]] = Field(
  280 + None,
  281 + description="LLM 富化属性(ES enriched_attributes 字段)"
  282 + )
  283 + enriched_taxonomy_attributes: Optional[List[Dict[str, Any]]] = Field(
  284 + None,
  285 + description="类目体系化属性(ES enriched_taxonomy_attributes 字段,例如 Color/Material)"
  286 + )
279 skus: List[SkuResult] = Field(default_factory=list, description="SKU列表") 287 skus: List[SkuResult] = Field(default_factory=list, description="SKU列表")
280 relevance_score: float = Field(..., ge=0.0, description="相关性分数(ES原始分数)") 288 relevance_score: float = Field(..., ge=0.0, description="相关性分数(ES原始分数)")
281 289
api/result_formatter.py
@@ -150,6 +150,8 @@ class ResultFormatter: @@ -150,6 +150,8 @@ class ResultFormatter:
150 option2_name=source.get('option2_name'), 150 option2_name=source.get('option2_name'),
151 option3_name=source.get('option3_name'), 151 option3_name=source.get('option3_name'),
152 specifications=source.get('specifications'), 152 specifications=source.get('specifications'),
  153 + enriched_attributes=source.get('enriched_attributes'),
  154 + enriched_taxonomy_attributes=source.get('enriched_taxonomy_attributes'),
153 skus=skus, 155 skus=skus,
154 relevance_score=relevance_score 156 relevance_score=relevance_score
155 ) 157 )
config/config.yaml
@@ -224,8 +224,8 @@ query_config: @@ -224,8 +224,8 @@ query_config:
224 # - keywords 224 # - keywords
225 # - qanchors 225 # - qanchors
226 # - enriched_tags 226 # - enriched_tags
227 - # - enriched_attributes  
228 - # - # enriched_taxonomy_attributes.value 227 + - enriched_attributes
  228 + - enriched_taxonomy_attributes
229 229
230 - min_price 230 - min_price
231 - compare_at_price 231 - compare_at_price
docs/搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md
@@ -275,7 +275,7 @@ curl "http://localhost:6007/health" @@ -275,7 +275,7 @@ curl "http://localhost:6007/health"
275 | `target_lang` | string | Y | 目标语言:`zh`、`en`、`ru` 等 | 275 | `target_lang` | string | Y | 目标语言:`zh`、`en`、`ru` 等 |
276 | `source_lang` | string | N | 源语言。云端模型可不传;`nllb-200-distilled-600m` 建议显式传入 | 276 | `source_lang` | string | N | 源语言。云端模型可不传;`nllb-200-distilled-600m` 建议显式传入 |
277 | `model` | string | N | 已启用 capability 名称,如 `qwen-mt`、`llm`、`deepl`、`nllb-200-distilled-600m`、`opus-mt-zh-en`、`opus-mt-en-zh` | 277 | `model` | string | N | 已启用 capability 名称,如 `qwen-mt`、`llm`、`deepl`、`nllb-200-distilled-600m`、`opus-mt-zh-en`、`opus-mt-en-zh` |
278 -| `scene` | string | N | 翻译场景参数,与 `model` 配套使用;当前标准值为 `sku_name`、`ecommerce_search_query`、`general` | 278 +| `scene` | string | N | 翻译场景参数,与 `model` 配套使用;当前标准值为 `sku_name`、`ecommerce_search_query`、`sku_attribute`、`general` |
279 279
280 说明: 280 说明:
281 - 外部接口不接受 `prompt`;LLM prompt 由服务端按 `scene` 自动生成。 281 - 外部接口不接受 `prompt`;LLM prompt 由服务端按 `scene` 自动生成。
@@ -287,7 +287,7 @@ curl "http://localhost:6007/health" @@ -287,7 +287,7 @@ curl "http://localhost:6007/health"
287 - 如果是en-zh互译、期待更高的速度,可以考虑`opus-mt-zh-en` / `opus-mt-en-zh`。(质量未详细评测,一些文章说比blib-200-600m更好,但是我看了些case感觉要差不少) 287 - 如果是en-zh互译、期待更高的速度,可以考虑`opus-mt-zh-en` / `opus-mt-en-zh`。(质量未详细评测,一些文章说比blib-200-600m更好,但是我看了些case感觉要差不少)
288 288
289 **实时翻译选型建议**: 289 **实时翻译选型建议**:
290 -- 在线 query 翻译如果只是 `en/zh` 互译,极致要求耗时使用 `opus-mt-zh-en / opus-mt-en-zh`,`nllb-200-distilled-600m`支持多语言,效果略好一点,但是耗时长很多(70-150ms之间 290 +- 在线 query 翻译如果只是 `en/zh` 互译,极致要求耗时使用 `opus-mt-zh-en / opus-mt-en-zh`,`nllb-200-distilled-600m`支持多语言,效果略好一点,但是耗时长很多(120-190ms左右
291 - 如果涉及其他语言,或对质量要求高于本地轻量模型,优先考虑 `deepl`。 291 - 如果涉及其他语言,或对质量要求高于本地轻量模型,优先考虑 `deepl`。
292 292
293 **Batch Size / 调用方式建议**: 293 **Batch Size / 调用方式建议**:
query/style_intent.py
@@ -134,6 +134,10 @@ class DetectedStyleIntent: @@ -134,6 +134,10 @@ class DetectedStyleIntent:
134 matched_query_text: str 134 matched_query_text: str
135 attribute_terms: Tuple[str, ...] 135 attribute_terms: Tuple[str, ...]
136 dimension_aliases: Tuple[str, ...] 136 dimension_aliases: Tuple[str, ...]
  137 + # Union of zh_terms + en_terms + attribute_terms for the matched term definition.
  138 + # Downstream SKU-selection treats every entry as a valid attribute-value match candidate
  139 + # so a Chinese user query like "卡其色" can match a Chinese option value "卡其色裙".
  140 + all_terms: Tuple[str, ...] = ()
137 141
138 def to_dict(self) -> Dict[str, Any]: 142 def to_dict(self) -> Dict[str, Any]:
139 return { 143 return {
@@ -143,8 +147,14 @@ class DetectedStyleIntent: @@ -143,8 +147,14 @@ class DetectedStyleIntent:
143 "matched_query_text": self.matched_query_text, 147 "matched_query_text": self.matched_query_text,
144 "attribute_terms": list(self.attribute_terms), 148 "attribute_terms": list(self.attribute_terms),
145 "dimension_aliases": list(self.dimension_aliases), 149 "dimension_aliases": list(self.dimension_aliases),
  150 + "all_terms": list(self.all_terms),
146 } 151 }
147 152
  153 + @property
  154 + def matching_terms(self) -> Tuple[str, ...]:
  155 + """Terms usable for attribute-value matching; falls back to attribute_terms for old callers."""
  156 + return self.all_terms or self.attribute_terms
  157 +
148 158
149 @dataclass(frozen=True) 159 @dataclass(frozen=True)
150 class StyleIntentProfile: 160 class StyleIntentProfile:
@@ -370,6 +380,15 @@ class StyleIntentDetector: @@ -370,6 +380,15 @@ class StyleIntentDetector:
370 if pair in seen_pairs: 380 if pair in seen_pairs:
371 continue 381 continue
372 seen_pairs.add(pair) 382 seen_pairs.add(pair)
  383 + all_terms = tuple(
  384 + dict.fromkeys(
  385 + (
  386 + *term_definition.zh_terms,
  387 + *term_definition.en_terms,
  388 + *term_definition.attribute_terms,
  389 + )
  390 + )
  391 + )
373 detected.append( 392 detected.append(
374 DetectedStyleIntent( 393 DetectedStyleIntent(
375 intent_type=intent_type, 394 intent_type=intent_type,
@@ -378,6 +397,7 @@ class StyleIntentDetector: @@ -378,6 +397,7 @@ class StyleIntentDetector:
378 matched_query_text=variant.text, 397 matched_query_text=variant.text,
379 attribute_terms=term_definition.attribute_terms, 398 attribute_terms=term_definition.attribute_terms,
380 dimension_aliases=definition.dimension_aliases, 399 dimension_aliases=definition.dimension_aliases,
  400 + all_terms=all_terms,
381 ) 401 )
382 ) 402 )
383 break 403 break
search/es_query_builder.py
@@ -213,6 +213,13 @@ class ESQueryBuilder: @@ -213,6 +213,13 @@ class ESQueryBuilder:
213 "_name": query_name, 213 "_name": query_name,
214 "query": {"knn": image_knn_query}, 214 "query": {"knn": image_knn_query},
215 "score_mode": "max", 215 "score_mode": "max",
  216 + # Expose the best-matching image entry (url, score) so SKU selection
  217 + # can promote the SKU whose image_src matches the winning url.
  218 + "inner_hits": {
  219 + "name": f"{query_name}_hits",
  220 + "size": 1,
  221 + "_source": ["url"],
  222 + },
216 } 223 }
217 } 224 }
218 return { 225 return {
@@ -276,6 +283,13 @@ class ESQueryBuilder: @@ -276,6 +283,13 @@ class ESQueryBuilder:
276 "_name": query_name, 283 "_name": query_name,
277 "score_mode": "max", 284 "score_mode": "max",
278 "query": {"script_score": script_score_query}, 285 "query": {"script_score": script_score_query},
  286 + # Same rationale as build_image_knn_clause: carry the winning url + score
  287 + # so downstream SKU selection can consume it without a second ES round-trip.
  288 + "inner_hits": {
  289 + "name": f"{query_name}_hits",
  290 + "size": 1,
  291 + "_source": ["url"],
  292 + },
279 } 293 }
280 } 294 }
281 return {"script_score": {"_name": query_name, **script_score_query}} 295 return {"script_score": {"_name": query_name, **script_score_query}}
search/searcher.py
@@ -354,7 +354,9 @@ class Searcher: @@ -354,7 +354,9 @@ class Searcher:
354 if not includes: 354 if not includes:
355 includes.add("title") 355 includes.add("title")
356 356
357 - if self._has_style_intent(parsed_query): 357 + if self._should_run_sku_selection(parsed_query):
  358 + # SKU-level fields are needed both by text matching (optionN_value) and
  359 + # by the image pick (image_src) of the unified SKU selector.
358 includes.update( 360 includes.update(
359 { 361 {
360 "skus", 362 "skus",
@@ -363,6 +365,10 @@ class Searcher: @@ -363,6 +365,10 @@ class Searcher:
363 "option3_name", 365 "option3_name",
364 } 366 }
365 ) 367 )
  368 + if self._has_style_intent(parsed_query):
  369 + # Treated as an additional value source for attribute matching
  370 + # (on the same dimension as optionN).
  371 + includes.add("enriched_taxonomy_attributes")
366 372
367 return {"includes": sorted(includes)} 373 return {"includes": sorted(includes)}
368 374
@@ -435,6 +441,23 @@ class Searcher: @@ -435,6 +441,23 @@ class Searcher:
435 profile = getattr(parsed_query, "style_intent_profile", None) 441 profile = getattr(parsed_query, "style_intent_profile", None)
436 return bool(getattr(profile, "is_active", False)) 442 return bool(getattr(profile, "is_active", False))
437 443
  444 + def _has_image_signal(self, parsed_query: Optional[ParsedQuery]) -> bool:
  445 + """True when the query carries an image vector that can drive an image-based SKU pick."""
  446 + if parsed_query is None:
  447 + return False
  448 + if not getattr(self.config.query_config, "image_embedding_field", None):
  449 + return False
  450 + return getattr(parsed_query, "image_query_vector", None) is not None
  451 +
  452 + def _should_run_sku_selection(self, parsed_query: Optional[ParsedQuery]) -> bool:
  453 + """Trigger unified SKU selection when either signal is present.
  454 +
  455 + Text-intent alone drives attribute-value matching; image signal alone drives
  456 + image-nearest SKU promotion; together, image is a visual tie-breaker inside
  457 + the text-matched set.
  458 + """
  459 + return self._has_style_intent(parsed_query) or self._has_image_signal(parsed_query)
  460 +
438 def _apply_style_intent_to_hits( 461 def _apply_style_intent_to_hits(
439 self, 462 self,
440 es_hits: List[Dict[str, Any]], 463 es_hits: List[Dict[str, Any]],
@@ -1067,7 +1090,7 @@ class Searcher: @@ -1067,7 +1090,7 @@ class Searcher:
1067 if fill_took: 1090 if fill_took:
1068 es_response["took"] = int((es_response.get("took", 0) or 0) + fill_took) 1091 es_response["took"] = int((es_response.get("took", 0) or 0) + fill_took)
1069 1092
1070 - if self._has_style_intent(parsed_query): 1093 + if self._should_run_sku_selection(parsed_query):
1071 style_intent_decisions = self._apply_style_intent_to_hits( 1094 style_intent_decisions = self._apply_style_intent_to_hits(
1072 es_response.get("hits", {}).get("hits") or [], 1095 es_response.get("hits", {}).get("hits") or [],
1073 parsed_query, 1096 parsed_query,
@@ -1075,7 +1098,7 @@ class Searcher: @@ -1075,7 +1098,7 @@ class Searcher:
1075 ) 1098 )
1076 if style_intent_decisions: 1099 if style_intent_decisions:
1077 context.logger.info( 1100 context.logger.info(
1078 - "款式意图 SKU 预筛选完成 | hits=%s", 1101 + "SKU 选择预处理完成 | hits=%s",
1079 len(style_intent_decisions), 1102 len(style_intent_decisions),
1080 extra={'reqid': context.reqid, 'uid': context.uid} 1103 extra={'reqid': context.reqid, 'uid': context.uid}
1081 ) 1104 )
@@ -1221,8 +1244,8 @@ class Searcher: @@ -1221,8 +1244,8 @@ class Searcher:
1221 extra={'reqid': context.reqid, 'uid': context.uid} 1244 extra={'reqid': context.reqid, 'uid': context.uid}
1222 ) 1245 )
1223 1246
1224 - # 非重排窗口:款式意图在 result_processing 之前执行,便于单独计时且与 ES 召回阶段衔接  
1225 - if self._has_style_intent(parsed_query) and not in_rank_window: 1247 + # 非重排窗口:SKU 选择(款式意图 OR 图像信号)在 result_processing 之前执行,便于单独计时
  1248 + if self._should_run_sku_selection(parsed_query) and not in_rank_window:
1226 es_hits_pre = es_response.get("hits", {}).get("hits") or [] 1249 es_hits_pre = es_response.get("hits", {}).get("hits") or []
1227 style_intent_decisions = self._apply_style_intent_to_hits( 1250 style_intent_decisions = self._apply_style_intent_to_hits(
1228 es_hits_pre, 1251 es_hits_pre,
@@ -1251,12 +1274,11 @@ class Searcher: @@ -1251,12 +1274,11 @@ class Searcher:
1251 coarse_debug_by_doc = _index_debug_rows_by_doc(context.get_intermediate_result('coarse_rank_scores', None)) 1274 coarse_debug_by_doc = _index_debug_rows_by_doc(context.get_intermediate_result('coarse_rank_scores', None))
1252 fine_debug_by_doc = _index_debug_rows_by_doc(context.get_intermediate_result('fine_rank_scores', None)) 1275 fine_debug_by_doc = _index_debug_rows_by_doc(context.get_intermediate_result('fine_rank_scores', None))
1253 1276
1254 - if self._has_style_intent(parsed_query):  
1255 - if style_intent_decisions:  
1256 - self.style_sku_selector.apply_precomputed_decisions(  
1257 - es_hits,  
1258 - style_intent_decisions,  
1259 - ) 1277 + if self._should_run_sku_selection(parsed_query) and style_intent_decisions:
  1278 + self.style_sku_selector.apply_precomputed_decisions(
  1279 + es_hits,
  1280 + style_intent_decisions,
  1281 + )
1260 1282
1261 # Format results using ResultFormatter 1283 # Format results using ResultFormatter
1262 formatted_results = ResultFormatter.format_search_results( 1284 formatted_results = ResultFormatter.format_search_results(
search/sku_intent_selector.py
1 """ 1 """
2 -SKU selection for style-intent-aware search results. 2 +SKU selection for style-intent-aware and image-aware search results.
  3 +
  4 +Unified algorithm (one pass per hit, no cascading fallback stages):
  5 +
  6 +1. Per active style intent, a SKU's attribute value for that dimension comes
  7 + from ONE of two sources, in priority order:
  8 + - ``option``: the SKU's own ``optionN_value`` on the slot resolved by the
  9 + intent's dimension aliases — authoritative whenever non-empty.
  10 + - ``taxonomy``: the SPU-level ``enriched_taxonomy_attributes`` value on the
  11 + same dimension — used only when the SKU has no own value (slot unresolved
  12 + or value empty). Never overrides a contradicting SKU-level value.
  13 +2. A SKU is "text-matched" iff every active intent finds a match on its
  14 + selected value source (tokens of zh/en/attribute synonyms; values are first
  15 + passed through ``_with_segment_boundaries_for_matching`` so brackets and
  16 + common separators split segments; pure-CJK terms still use a substring
  17 + fallback when the value is one undivided CJK run, e.g. ``卡其色棉``). We
  18 + remember the matching source and the raw matched
  19 + text per intent so the final decision can surface it.
  20 +3. The image-pick comes straight from the nested ``image_embedding`` inner_hits
  21 + (``exact_image_knn_query_hits`` preferred, ``image_knn_query_hits``
  22 + otherwise): the SKU whose ``image_src`` equals the top-scoring url.
  23 +4. Unified selection:
  24 + - if the text-matched set is non-empty → pick image_pick when it lies in
  25 + that set (visual tie-break among text-matched), otherwise the first
  26 + text-matched SKU;
  27 + - else → pick image_pick if any;
  28 + - else → no decision (``final_source == "none"``).
  29 +
  30 +``final_source`` values (weakest → strongest text evidence, reversed):
  31 + ``option`` > ``taxonomy`` > ``image`` > ``none``. If any intent was satisfied
  32 + only via taxonomy the overall source degrades to ``taxonomy`` so downstream
  33 + callers can decide whether to differentiate the SPU-level signal from a
  34 + true SKU-level option match.
  35 +
  36 +No embedding fallback, no stage cascade, no score thresholds.
3 """ 37 """
4 38
5 from __future__ import annotations 39 from __future__ import annotations
6 40
7 from dataclasses import dataclass, field 41 from dataclasses import dataclass, field
8 from typing import Any, Callable, Dict, List, Optional, Tuple 42 from typing import Any, Callable, Dict, List, Optional, Tuple
  43 +from urllib.parse import urlsplit
  44 +
  45 +from query.style_intent import (
  46 + DetectedStyleIntent,
  47 + StyleIntentProfile,
  48 + StyleIntentRegistry,
  49 +)
  50 +from query.tokenization import (
  51 + contains_han_text,
  52 + normalize_query_text,
  53 + simple_tokenize_query,
  54 +)
  55 +
  56 +import re
  57 +
  58 +_NON_HAN_RE = re.compile(r"[^\u4e00-\u9fff]")
  59 +# Zero-width / BOM (often pasted from Excel or CMS).
  60 +_ZW_AND_BOM_RE = re.compile(r"[\u200b-\u200d\ufeff\u2060]")
  61 +# Brackets, slashes, and common commerce/list punctuation → segment boundaries so
  62 +# tokenization can align intent terms (e.g. 卡其色) with the leading segment of
  63 +# 卡其色(无内衬) / 卡其色/常规 / 卡其色·麻 等,without relying only on substring.
  64 +_ATTRIBUTE_BOUNDARY_RE = re.compile(
  65 + r"[\s\u3000]" # ASCII / ideographic space
  66 + r"|[\(\)\[\]\{\}()【】{}〈〉《》「」『』[]「」]"
  67 + r"|[/\\||/\︱丨]"
  68 + r"|[,,、;;::.。]"
  69 + r"|[·•・]"
  70 + r"|[~~]"
  71 + r"|[+\=#%&*×※]"
  72 + r"|[\u2010-\u2015\u2212]" # hyphen, en dash, minus, etc.
  73 +)
  74 +
  75 +
  76 +def _is_pure_han(value: str) -> bool:
  77 + """True if the string is non-empty and contains only CJK Unified Ideographs."""
  78 + return bool(value) and not _NON_HAN_RE.search(value)
  79 +
  80 +
  81 +def _with_segment_boundaries_for_matching(normalized_value: str) -> str:
  82 + """Normalize commerce-style option/taxonomy strings for token matching.
  83 +
  84 + Inserts word boundaries at brackets and typical separators so
  85 + ``simple_tokenize_query`` yields segments like ``['卡其色', '无内衬']`` instead
  86 + of one undifferentiated CJK blob when unusual punctuation appears.
  87 + """
  88 + if not normalized_value:
  89 + return ""
  90 + s = _ZW_AND_BOM_RE.sub("", normalized_value)
  91 + s = _ATTRIBUTE_BOUNDARY_RE.sub(" ", s)
  92 + return " ".join(s.split())
  93 +
  94 +
  95 +_IMAGE_INNER_HITS_KEYS: Tuple[str, ...] = (
  96 + "exact_image_knn_query_hits",
  97 + "image_knn_query_hits",
  98 +)
9 99
10 -from query.style_intent import StyleIntentProfile, StyleIntentRegistry  
11 -from query.tokenization import normalize_query_text, simple_tokenize_query 100 +
  101 +@dataclass(frozen=True)
  102 +class ImagePick:
  103 + sku_id: str
  104 + url: str
  105 + score: float
12 106
13 107
14 @dataclass(frozen=True) 108 @dataclass(frozen=True)
@@ -16,31 +110,52 @@ class SkuSelectionDecision: @@ -16,31 +110,52 @@ class SkuSelectionDecision:
16 selected_sku_id: Optional[str] 110 selected_sku_id: Optional[str]
17 rerank_suffix: str 111 rerank_suffix: str
18 selected_text: str 112 selected_text: str
19 - matched_stage: str  
20 - similarity_score: Optional[float] = None 113 + # "option" | "taxonomy" | "image" | "none"
  114 + final_source: str
21 resolved_dimensions: Dict[str, Optional[str]] = field(default_factory=dict) 115 resolved_dimensions: Dict[str, Optional[str]] = field(default_factory=dict)
  116 + # Per-intent matching-source breakdown, e.g. {"color": "option", "size": "taxonomy"}.
  117 + matched_sources: Dict[str, str] = field(default_factory=dict)
  118 + image_pick_sku_id: Optional[str] = None
  119 + image_pick_url: Optional[str] = None
  120 + image_pick_score: Optional[float] = None
  121 +
  122 + # Backward-compat alias; some older callers/tests look at ``matched_stage``.
  123 + @property
  124 + def matched_stage(self) -> str:
  125 + return self.final_source
22 126
23 def to_dict(self) -> Dict[str, Any]: 127 def to_dict(self) -> Dict[str, Any]:
24 return { 128 return {
25 "selected_sku_id": self.selected_sku_id, 129 "selected_sku_id": self.selected_sku_id,
26 "rerank_suffix": self.rerank_suffix, 130 "rerank_suffix": self.rerank_suffix,
27 "selected_text": self.selected_text, 131 "selected_text": self.selected_text,
28 - "matched_stage": self.matched_stage,  
29 - "similarity_score": self.similarity_score, 132 + "final_source": self.final_source,
  133 + "matched_sources": dict(self.matched_sources),
30 "resolved_dimensions": dict(self.resolved_dimensions), 134 "resolved_dimensions": dict(self.resolved_dimensions),
  135 + "image_pick": (
  136 + {
  137 + "sku_id": self.image_pick_sku_id,
  138 + "url": self.image_pick_url,
  139 + "score": self.image_pick_score,
  140 + }
  141 + if self.image_pick_sku_id or self.image_pick_url
  142 + else None
  143 + ),
31 } 144 }
32 145
33 146
34 @dataclass 147 @dataclass
35 class _SelectionContext: 148 class _SelectionContext:
36 - attribute_terms_by_intent: Dict[str, Tuple[str, ...]] 149 + """Request-scoped memo for term tokenization and substring match probes."""
  150 +
  151 + terms_by_intent: Dict[str, Tuple[str, ...]]
37 normalized_text_cache: Dict[str, str] = field(default_factory=dict) 152 normalized_text_cache: Dict[str, str] = field(default_factory=dict)
38 tokenized_text_cache: Dict[str, Tuple[str, ...]] = field(default_factory=dict) 153 tokenized_text_cache: Dict[str, Tuple[str, ...]] = field(default_factory=dict)
39 text_match_cache: Dict[Tuple[str, str], bool] = field(default_factory=dict) 154 text_match_cache: Dict[Tuple[str, str], bool] = field(default_factory=dict)
40 155
41 156
42 class StyleSkuSelector: 157 class StyleSkuSelector:
43 - """Selects the best SKU for an SPU based on detected style intent.""" 158 + """Selects the best SKU per hit from style-intent text match + image KNN."""
44 159
45 def __init__( 160 def __init__(
46 self, 161 self,
@@ -49,29 +164,47 @@ class StyleSkuSelector: @@ -49,29 +164,47 @@ class StyleSkuSelector:
49 text_encoder_getter: Optional[Callable[[], Any]] = None, 164 text_encoder_getter: Optional[Callable[[], Any]] = None,
50 ) -> None: 165 ) -> None:
51 self.registry = registry 166 self.registry = registry
  167 + # Retained for API back-compat; no longer used now that embedding fallback is gone.
52 self._text_encoder_getter = text_encoder_getter 168 self._text_encoder_getter = text_encoder_getter
53 169
  170 + # ------------------------------------------------------------------
  171 + # Public entry points
  172 + # ------------------------------------------------------------------
54 def prepare_hits( 173 def prepare_hits(
55 self, 174 self,
56 es_hits: List[Dict[str, Any]], 175 es_hits: List[Dict[str, Any]],
57 parsed_query: Any, 176 parsed_query: Any,
58 ) -> Dict[str, SkuSelectionDecision]: 177 ) -> Dict[str, SkuSelectionDecision]:
  178 + """Compute selection decisions (without mutating ``_source``).
  179 +
  180 + Runs if either a style intent is active OR any hit carries image
  181 + inner_hits. Decisions are keyed by ES ``_id`` and meant to be applied
  182 + later via :meth:`apply_precomputed_decisions` (after page fill).
  183 + """
59 decisions: Dict[str, SkuSelectionDecision] = {} 184 decisions: Dict[str, SkuSelectionDecision] = {}
60 style_profile = getattr(parsed_query, "style_intent_profile", None) 185 style_profile = getattr(parsed_query, "style_intent_profile", None)
61 - if not isinstance(style_profile, StyleIntentProfile) or not style_profile.is_active:  
62 - return decisions  
63 -  
64 - selection_context = self._build_selection_context(style_profile) 186 + style_active = (
  187 + isinstance(style_profile, StyleIntentProfile) and style_profile.is_active
  188 + )
  189 + selection_context = (
  190 + self._build_selection_context(style_profile) if style_active else None
  191 + )
65 192
66 for hit in es_hits: 193 for hit in es_hits:
67 source = hit.get("_source") 194 source = hit.get("_source")
68 if not isinstance(source, dict): 195 if not isinstance(source, dict):
69 continue 196 continue
70 197
71 - decision = self._select_for_source(  
72 - source,  
73 - style_profile=style_profile, 198 + image_pick = self._pick_sku_by_image(hit, source)
  199 + if not style_active and image_pick is None:
  200 + # Nothing to do for this hit.
  201 + continue
  202 +
  203 + decision = self._select(
  204 + source=source,
  205 + style_profile=style_profile if style_active else None,
74 selection_context=selection_context, 206 selection_context=selection_context,
  207 + image_pick=image_pick,
75 ) 208 )
76 if decision is None: 209 if decision is None:
77 continue 210 continue
@@ -94,7 +227,6 @@ class StyleSkuSelector: @@ -94,7 +227,6 @@ class StyleSkuSelector:
94 ) -> None: 227 ) -> None:
95 if not es_hits or not decisions: 228 if not es_hits or not decisions:
96 return 229 return
97 -  
98 for hit in es_hits: 230 for hit in es_hits:
99 doc_id = hit.get("_id") 231 doc_id = hit.get("_id")
100 if doc_id is None: 232 if doc_id is None:
@@ -111,122 +243,90 @@ class StyleSkuSelector: @@ -111,122 +243,90 @@ class StyleSkuSelector:
111 else: 243 else:
112 hit.pop("_style_rerank_suffix", None) 244 hit.pop("_style_rerank_suffix", None)
113 245
  246 + # ------------------------------------------------------------------
  247 + # Selection context & text matching
  248 + # ------------------------------------------------------------------
114 def _build_selection_context( 249 def _build_selection_context(
115 self, 250 self,
116 style_profile: StyleIntentProfile, 251 style_profile: StyleIntentProfile,
117 ) -> _SelectionContext: 252 ) -> _SelectionContext:
118 - attribute_terms_by_intent: Dict[str, List[str]] = {} 253 + terms_by_intent: Dict[str, List[str]] = {}
119 for intent in style_profile.intents: 254 for intent in style_profile.intents:
120 - terms = attribute_terms_by_intent.setdefault(intent.intent_type, [])  
121 - for raw_term in intent.attribute_terms: 255 + terms = terms_by_intent.setdefault(intent.intent_type, [])
  256 + for raw_term in intent.matching_terms:
122 normalized_term = normalize_query_text(raw_term) 257 normalized_term = normalize_query_text(raw_term)
123 - if not normalized_term or normalized_term in terms:  
124 - continue  
125 - terms.append(normalized_term)  
126 - 258 + if normalized_term and normalized_term not in terms:
  259 + terms.append(normalized_term)
127 return _SelectionContext( 260 return _SelectionContext(
128 - attribute_terms_by_intent={ 261 + terms_by_intent={
129 intent_type: tuple(terms) 262 intent_type: tuple(terms)
130 - for intent_type, terms in attribute_terms_by_intent.items() 263 + for intent_type, terms in terms_by_intent.items()
131 }, 264 },
132 ) 265 )
133 266
134 - @staticmethod  
135 - def _normalize_cached(selection_context: _SelectionContext, value: Any) -> str: 267 + def _normalize_cached(self, ctx: _SelectionContext, value: Any) -> str:
136 raw = str(value or "").strip() 268 raw = str(value or "").strip()
137 if not raw: 269 if not raw:
138 return "" 270 return ""
139 - cached = selection_context.normalized_text_cache.get(raw) 271 + cached = ctx.normalized_text_cache.get(raw)
140 if cached is not None: 272 if cached is not None:
141 return cached 273 return cached
142 normalized = normalize_query_text(raw) 274 normalized = normalize_query_text(raw)
143 - selection_context.normalized_text_cache[raw] = normalized 275 + ctx.normalized_text_cache[raw] = normalized
144 return normalized 276 return normalized
145 277
146 - def _resolve_dimensions(  
147 - self,  
148 - source: Dict[str, Any],  
149 - style_profile: StyleIntentProfile,  
150 - ) -> Dict[str, Optional[str]]:  
151 - option_names = {  
152 - "option1_value": normalize_query_text(source.get("option1_name")),  
153 - "option2_value": normalize_query_text(source.get("option2_name")),  
154 - "option3_value": normalize_query_text(source.get("option3_name")),  
155 - }  
156 - resolved: Dict[str, Optional[str]] = {}  
157 - for intent in style_profile.intents:  
158 - if intent.intent_type in resolved:  
159 - continue  
160 - aliases = set(intent.dimension_aliases or self.registry.get_dimension_aliases(intent.intent_type))  
161 - matched_field = None  
162 - for field_name, option_name in option_names.items():  
163 - if option_name and option_name in aliases:  
164 - matched_field = field_name  
165 - break  
166 - resolved[intent.intent_type] = matched_field  
167 - return resolved  
168 -  
169 - @staticmethod  
170 - def _empty_decision(  
171 - resolved_dimensions: Dict[str, Optional[str]],  
172 - matched_stage: str,  
173 - ) -> SkuSelectionDecision:  
174 - return SkuSelectionDecision(  
175 - selected_sku_id=None,  
176 - rerank_suffix="",  
177 - selected_text="",  
178 - matched_stage=matched_stage,  
179 - resolved_dimensions=dict(resolved_dimensions), 278 + def _tokenize_cached(self, ctx: _SelectionContext, value: str) -> Tuple[str, ...]:
  279 + normalized_value = normalize_query_text(value)
  280 + if not normalized_value:
  281 + return ()
  282 + cached = ctx.tokenized_text_cache.get(normalized_value)
  283 + if cached is not None:
  284 + return cached
  285 + tokens = tuple(
  286 + normalize_query_text(token)
  287 + for token in simple_tokenize_query(normalized_value)
  288 + if token
180 ) 289 )
  290 + ctx.tokenized_text_cache[normalized_value] = tokens
  291 + return tokens
181 292
182 def _is_text_match( 293 def _is_text_match(
183 self, 294 self,
184 intent_type: str, 295 intent_type: str,
185 - selection_context: _SelectionContext, 296 + ctx: _SelectionContext,
186 *, 297 *,
187 normalized_value: str, 298 normalized_value: str,
188 ) -> bool: 299 ) -> bool:
  300 + """True iff any intent term token-boundary matches the given value."""
189 if not normalized_value: 301 if not normalized_value:
190 return False 302 return False
191 -  
192 cache_key = (intent_type, normalized_value) 303 cache_key = (intent_type, normalized_value)
193 - cached = selection_context.text_match_cache.get(cache_key) 304 + cached = ctx.text_match_cache.get(cache_key)
194 if cached is not None: 305 if cached is not None:
195 return cached 306 return cached
196 307
197 - attribute_terms = selection_context.attribute_terms_by_intent.get(intent_type, ())  
198 - value_tokens = self._tokenize_cached(selection_context, normalized_value) 308 + terms = ctx.terms_by_intent.get(intent_type, ())
  309 + segmented = _with_segment_boundaries_for_matching(normalized_value)
  310 + value_tokens = self._tokenize_cached(ctx, segmented)
199 matched = any( 311 matched = any(
200 self._matches_term_tokens( 312 self._matches_term_tokens(
201 term=term, 313 term=term,
202 value_tokens=value_tokens, 314 value_tokens=value_tokens,
203 - selection_context=selection_context, 315 + ctx=ctx,
204 normalized_value=normalized_value, 316 normalized_value=normalized_value,
205 ) 317 )
206 - for term in attribute_terms 318 + for term in terms
207 if term 319 if term
208 ) 320 )
209 - selection_context.text_match_cache[cache_key] = matched 321 + ctx.text_match_cache[cache_key] = matched
210 return matched 322 return matched
211 323
212 - @staticmethod  
213 - def _tokenize_cached(selection_context: _SelectionContext, value: str) -> Tuple[str, ...]:  
214 - normalized_value = normalize_query_text(value)  
215 - if not normalized_value:  
216 - return ()  
217 - cached = selection_context.tokenized_text_cache.get(normalized_value)  
218 - if cached is not None:  
219 - return cached  
220 - tokens = tuple(normalize_query_text(token) for token in simple_tokenize_query(normalized_value) if token)  
221 - selection_context.tokenized_text_cache[normalized_value] = tokens  
222 - return tokens  
223 -  
224 def _matches_term_tokens( 324 def _matches_term_tokens(
225 self, 325 self,
226 *, 326 *,
227 term: str, 327 term: str,
228 value_tokens: Tuple[str, ...], 328 value_tokens: Tuple[str, ...],
229 - selection_context: _SelectionContext, 329 + ctx: _SelectionContext,
230 normalized_value: str, 330 normalized_value: str,
231 ) -> bool: 331 ) -> bool:
232 normalized_term = normalize_query_text(term) 332 normalized_term = normalize_query_text(term)
@@ -234,8 +334,13 @@ class StyleSkuSelector: @@ -234,8 +334,13 @@ class StyleSkuSelector:
234 return False 334 return False
235 if normalized_term == normalized_value: 335 if normalized_term == normalized_value:
236 return True 336 return True
237 -  
238 - term_tokens = self._tokenize_cached(selection_context, normalized_term) 337 + # Pure-CJK terms can't be split further by the whitespace/regex tokenizer
  338 + # ("卡其色棉" is one token), so sliding-window token match would miss the prefix.
  339 + # Fall back to normalized substring containment — safe because this branch
  340 + # never triggers for Latin tokens where substring would cause "l" ⊂ "xl" issues.
  341 + if _is_pure_han(normalized_term) and contains_han_text(normalized_value):
  342 + return normalized_term in normalized_value
  343 + term_tokens = self._tokenize_cached(ctx, normalized_term)
239 if not term_tokens or not value_tokens: 344 if not term_tokens or not value_tokens:
240 return normalized_term in normalized_value 345 return normalized_term in normalized_value
241 346
@@ -243,106 +348,383 @@ class StyleSkuSelector: @@ -243,106 +348,383 @@ class StyleSkuSelector:
243 value_length = len(value_tokens) 348 value_length = len(value_tokens)
244 if term_length > value_length: 349 if term_length > value_length:
245 return False 350 return False
246 -  
247 for start in range(value_length - term_length + 1): 351 for start in range(value_length - term_length + 1):
248 - if value_tokens[start:start + term_length] == term_tokens: 352 + if value_tokens[start : start + term_length] == term_tokens:
249 return True 353 return True
250 return False 354 return False
251 355
252 - def _find_first_text_match( 356 + # ------------------------------------------------------------------
  357 + # Dimension resolution (option slot + taxonomy values)
  358 + # ------------------------------------------------------------------
  359 + def _resolve_dimensions(
253 self, 360 self,
254 - skus: List[Dict[str, Any]],  
255 - resolved_dimensions: Dict[str, Optional[str]],  
256 - selection_context: _SelectionContext,  
257 - ) -> Optional[Tuple[str, str]]:  
258 - for sku in skus:  
259 - selection_parts: List[str] = []  
260 - seen_parts: set[str] = set()  
261 - matched = True  
262 -  
263 - for intent_type, field_name in resolved_dimensions.items():  
264 - if not field_name:  
265 - matched = False  
266 - break  
267 -  
268 - raw_value = str(sku.get(field_name) or "").strip()  
269 - normalized_value = self._normalize_cached(selection_context, raw_value)  
270 - if not self._is_text_match(  
271 - intent_type,  
272 - selection_context,  
273 - normalized_value=normalized_value,  
274 - ):  
275 - matched = False 361 + source: Dict[str, Any],
  362 + style_profile: StyleIntentProfile,
  363 + ) -> Dict[str, Optional[str]]:
  364 + option_fields = (
  365 + ("option1_value", source.get("option1_name")),
  366 + ("option2_value", source.get("option2_name")),
  367 + ("option3_value", source.get("option3_name")),
  368 + )
  369 + option_aliases = [
  370 + (field_name, normalize_query_text(name))
  371 + for field_name, name in option_fields
  372 + ]
  373 + resolved: Dict[str, Optional[str]] = {}
  374 + for intent in style_profile.intents:
  375 + if intent.intent_type in resolved:
  376 + continue
  377 + aliases = set(
  378 + intent.dimension_aliases
  379 + or self.registry.get_dimension_aliases(intent.intent_type)
  380 + )
  381 + matched_field: Optional[str] = None
  382 + for field_name, option_name in option_aliases:
  383 + if option_name and option_name in aliases:
  384 + matched_field = field_name
276 break 385 break
  386 + resolved[intent.intent_type] = matched_field
  387 + return resolved
277 388
278 - if raw_value and normalized_value not in seen_parts:  
279 - seen_parts.add(normalized_value)  
280 - selection_parts.append(raw_value) 389 + def _collect_taxonomy_values(
  390 + self,
  391 + source: Dict[str, Any],
  392 + style_profile: StyleIntentProfile,
  393 + ) -> Dict[str, Tuple[Tuple[str, str], ...]]:
  394 + """Extract SPU-level enriched_taxonomy_attributes values per intent dimension.
  395 +
  396 + Returns a mapping ``intent_type -> ((normalized, raw), ...)`` so the
  397 + selection layer can (a) match against ``normalized`` and (b) surface
  398 + the human-readable ``raw`` form in ``selected_text``.
  399 + """
  400 + attrs = source.get("enriched_taxonomy_attributes")
  401 + if not isinstance(attrs, list) or not attrs:
  402 + return {}
  403 + aliases_by_intent = {
  404 + intent.intent_type: set(
  405 + intent.dimension_aliases
  406 + or self.registry.get_dimension_aliases(intent.intent_type)
  407 + )
  408 + for intent in style_profile.intents
  409 + }
  410 + values_by_intent: Dict[str, List[Tuple[str, str]]] = {
  411 + t: [] for t in aliases_by_intent
  412 + }
  413 + for attr in attrs:
  414 + if not isinstance(attr, dict):
  415 + continue
  416 + attr_name = normalize_query_text(attr.get("name"))
  417 + if not attr_name:
  418 + continue
  419 + matching_intents = [
  420 + t for t, aliases in aliases_by_intent.items() if attr_name in aliases
  421 + ]
  422 + if not matching_intents:
  423 + continue
  424 + for raw_text in _iter_multilingual_texts(attr.get("value")):
  425 + raw = str(raw_text).strip()
  426 + if not raw:
  427 + continue
  428 + normalized = normalize_query_text(raw)
  429 + if not normalized:
  430 + continue
  431 + for intent_type in matching_intents:
  432 + bucket = values_by_intent[intent_type]
  433 + if not any(existing_norm == normalized for existing_norm, _ in bucket):
  434 + bucket.append((normalized, raw))
  435 + return {t: tuple(v) for t, v in values_by_intent.items() if v}
  436 +
  437 + # ------------------------------------------------------------------
  438 + # Image pick
  439 + # ------------------------------------------------------------------
  440 + @staticmethod
  441 + def _normalize_url(url: Any) -> str:
  442 + raw = str(url or "").strip()
  443 + if not raw:
  444 + return ""
  445 + # Accept protocol-relative URLs like "//cdn/..." or full URLs.
  446 + if raw.startswith("//"):
  447 + raw = "https:" + raw
  448 + try:
  449 + parts = urlsplit(raw)
  450 + except ValueError:
  451 + return raw.casefold()
  452 + host = (parts.netloc or "").casefold()
  453 + path = parts.path or ""
  454 + return f"{host}{path}".casefold()
  455 +
  456 + def _pick_sku_by_image(
  457 + self,
  458 + hit: Dict[str, Any],
  459 + source: Dict[str, Any],
  460 + ) -> Optional[ImagePick]:
  461 + inner_hits = hit.get("inner_hits")
  462 + if not isinstance(inner_hits, dict):
  463 + return None
  464 + top_url: Optional[str] = None
  465 + top_score: Optional[float] = None
  466 + for key in _IMAGE_INNER_HITS_KEYS:
  467 + payload = inner_hits.get(key)
  468 + if not isinstance(payload, dict):
  469 + continue
  470 + hits_block = payload.get("hits")
  471 + inner_list = hits_block.get("hits") if isinstance(hits_block, dict) else None
  472 + if not isinstance(inner_list, list) or not inner_list:
  473 + continue
  474 + for entry in inner_list:
  475 + if not isinstance(entry, dict):
  476 + continue
  477 + url = (entry.get("_source") or {}).get("url")
  478 + if not url:
  479 + continue
  480 + try:
  481 + score = float(entry.get("_score") or 0.0)
  482 + except (TypeError, ValueError):
  483 + score = 0.0
  484 + if top_score is None or score > top_score:
  485 + top_url = str(url)
  486 + top_score = score
  487 + if top_url is not None:
  488 + break # Prefer the first listed inner_hits source (exact > approx).
  489 + if top_url is None:
  490 + return None
281 491
282 - if matched:  
283 - return str(sku.get("sku_id") or ""), " ".join(selection_parts).strip() 492 + skus = source.get("skus")
  493 + if not isinstance(skus, list):
  494 + return None
  495 + target = self._normalize_url(top_url)
  496 + for sku in skus:
  497 + sku_url = self._normalize_url(sku.get("image_src") or sku.get("imageSrc"))
  498 + if sku_url and sku_url == target:
  499 + return ImagePick(
  500 + sku_id=str(sku.get("sku_id") or ""),
  501 + url=top_url,
  502 + score=float(top_score or 0.0),
  503 + )
284 return None 504 return None
285 505
286 - def _select_for_source( 506 + # ------------------------------------------------------------------
  507 + # Unified per-hit selection
  508 + # ------------------------------------------------------------------
  509 + def _select(
287 self, 510 self,
288 - source: Dict[str, Any],  
289 *, 511 *,
290 - style_profile: StyleIntentProfile,  
291 - selection_context: _SelectionContext, 512 + source: Dict[str, Any],
  513 + style_profile: Optional[StyleIntentProfile],
  514 + selection_context: Optional[_SelectionContext],
  515 + image_pick: Optional[ImagePick],
292 ) -> Optional[SkuSelectionDecision]: 516 ) -> Optional[SkuSelectionDecision]:
293 skus = source.get("skus") 517 skus = source.get("skus")
294 if not isinstance(skus, list) or not skus: 518 if not isinstance(skus, list) or not skus:
295 return None 519 return None
296 520
297 - resolved_dimensions = self._resolve_dimensions(source, style_profile)  
298 - if not resolved_dimensions or any(not field_name for field_name in resolved_dimensions.values()):  
299 - return self._empty_decision(resolved_dimensions, matched_stage="unresolved") 521 + resolved_dimensions: Dict[str, Optional[str]] = {}
  522 + text_matched: List[Tuple[Dict[str, Any], Dict[str, Tuple[str, str]]]] = []
  523 +
  524 + if style_profile is not None and selection_context is not None:
  525 + resolved_dimensions = self._resolve_dimensions(source, style_profile)
  526 + taxonomy_values = self._collect_taxonomy_values(source, style_profile)
  527 + # Only attempt text match when there is at least one value source
  528 + # per intent (SKU option or SPU taxonomy).
  529 + if all(
  530 + resolved_dimensions.get(intent.intent_type) is not None
  531 + or taxonomy_values.get(intent.intent_type)
  532 + for intent in style_profile.intents
  533 + ):
  534 + text_matched = self._find_text_matched_skus(
  535 + skus=skus,
  536 + style_profile=style_profile,
  537 + resolved_dimensions=resolved_dimensions,
  538 + taxonomy_values=taxonomy_values,
  539 + ctx=selection_context,
  540 + )
  541 +
  542 + selected_sku_id: Optional[str] = None
  543 + selected_text = ""
  544 + final_source = "none"
  545 + matched_sources: Dict[str, str] = {}
  546 +
  547 + if text_matched:
  548 + chosen_sku, per_intent = self._choose_among_text_matched(
  549 + text_matched, image_pick
  550 + )
  551 + selected_sku_id = str(chosen_sku.get("sku_id") or "") or None
  552 + selected_text = self._text_from_matches(per_intent)
  553 + matched_sources = {
  554 + intent_type: src for intent_type, (src, _) in per_intent.items()
  555 + }
  556 + final_source = (
  557 + "taxonomy" if "taxonomy" in matched_sources.values() else "option"
  558 + )
  559 + elif image_pick is not None:
  560 + image_sku = self._find_sku_by_id(skus, image_pick.sku_id)
  561 + if image_sku is not None:
  562 + selected_sku_id = image_pick.sku_id or None
  563 + selected_text = self._build_selected_text(image_sku, resolved_dimensions)
  564 + final_source = "image"
300 565
301 - text_match = self._find_first_text_match(skus, resolved_dimensions, selection_context)  
302 - if text_match is None:  
303 - return self._empty_decision(resolved_dimensions, matched_stage="no_match")  
304 - return self._build_decision(  
305 - selected_sku_id=text_match[0],  
306 - selected_text=text_match[1], 566 + return SkuSelectionDecision(
  567 + selected_sku_id=selected_sku_id,
  568 + rerank_suffix=selected_text,
  569 + selected_text=selected_text,
  570 + final_source=final_source,
307 resolved_dimensions=resolved_dimensions, 571 resolved_dimensions=resolved_dimensions,
308 - matched_stage="text", 572 + matched_sources=matched_sources,
  573 + image_pick_sku_id=(image_pick.sku_id or None) if image_pick else None,
  574 + image_pick_url=image_pick.url if image_pick else None,
  575 + image_pick_score=image_pick.score if image_pick else None,
309 ) 576 )
310 577
311 - @staticmethod  
312 - def _build_decision(  
313 - selected_sku_id: str,  
314 - selected_text: str,  
315 - resolved_dimensions: Dict[str, Optional[str]], 578 + def _find_text_matched_skus(
  579 + self,
316 *, 580 *,
317 - matched_stage: str,  
318 - similarity_score: Optional[float] = None,  
319 - ) -> SkuSelectionDecision:  
320 - return SkuSelectionDecision(  
321 - selected_sku_id=selected_sku_id or None,  
322 - rerank_suffix=str(selected_text or "").strip(),  
323 - selected_text=str(selected_text or "").strip(),  
324 - matched_stage=matched_stage,  
325 - similarity_score=similarity_score,  
326 - resolved_dimensions=dict(resolved_dimensions),  
327 - ) 581 + skus: List[Dict[str, Any]],
  582 + style_profile: StyleIntentProfile,
  583 + resolved_dimensions: Dict[str, Optional[str]],
  584 + taxonomy_values: Dict[str, Tuple[Tuple[str, str], ...]],
  585 + ctx: _SelectionContext,
  586 + ) -> List[Tuple[Dict[str, Any], Dict[str, Tuple[str, str]]]]:
  587 + """Return every SKU that satisfies every active intent, with match meta.
  588 +
  589 + Authority rule per intent:
  590 + - If the SKU has a non-empty value on the resolved option slot, that
  591 + value ALONE decides the match (source = ``option``). Taxonomy cannot
  592 + override a contradicting SKU-level value.
  593 + - Only when the SKU has no own value on the dimension (slot unresolved
  594 + or value empty) does the SPU-level taxonomy serve as the fallback
  595 + value source (source = ``taxonomy``).
  596 +
  597 + For each matched SKU we also return a per-intent dict mapping
  598 + ``intent_type -> (source, raw_matched_text)`` so the final decision can
  599 + surface the genuinely matched string in ``selected_text`` /
  600 + ``rerank_suffix`` rather than, e.g., a SKU's unrelated option value.
  601 + """
  602 + matched: List[Tuple[Dict[str, Any], Dict[str, Tuple[str, str]]]] = []
  603 + for sku in skus:
  604 + per_intent: Dict[str, Tuple[str, str]] = {}
  605 + all_ok = True
  606 + for intent in style_profile.intents:
  607 + slot = resolved_dimensions.get(intent.intent_type)
  608 + sku_raw = str(sku.get(slot) or "").strip() if slot else ""
  609 + sku_norm = normalize_query_text(sku_raw) if sku_raw else ""
  610 +
  611 + if sku_norm:
  612 + if self._is_text_match(
  613 + intent.intent_type, ctx, normalized_value=sku_norm
  614 + ):
  615 + per_intent[intent.intent_type] = ("option", sku_raw)
  616 + else:
  617 + all_ok = False
  618 + break
  619 + else:
  620 + matched_raw: Optional[str] = None
  621 + for tax_norm, tax_raw in taxonomy_values.get(
  622 + intent.intent_type, ()
  623 + ):
  624 + if self._is_text_match(
  625 + intent.intent_type, ctx, normalized_value=tax_norm
  626 + ):
  627 + matched_raw = tax_raw
  628 + break
  629 + if matched_raw is None:
  630 + all_ok = False
  631 + break
  632 + per_intent[intent.intent_type] = ("taxonomy", matched_raw)
  633 + if all_ok:
  634 + matched.append((sku, per_intent))
  635 + return matched
  636 +
  637 + @staticmethod
  638 + def _choose_among_text_matched(
  639 + text_matched: List[Tuple[Dict[str, Any], Dict[str, Tuple[str, str]]]],
  640 + image_pick: Optional[ImagePick],
  641 + ) -> Tuple[Dict[str, Any], Dict[str, Tuple[str, str]]]:
  642 + """Image-visual tie-break inside the text-matched set; else first match."""
  643 + if image_pick and image_pick.sku_id:
  644 + for sku, per_intent in text_matched:
  645 + if str(sku.get("sku_id") or "") == image_pick.sku_id:
  646 + return sku, per_intent
  647 + return text_matched[0]
  648 +
  649 + @staticmethod
  650 + def _text_from_matches(per_intent: Dict[str, Tuple[str, str]]) -> str:
  651 + """Join the genuinely matched raw strings in intent declaration order."""
  652 + parts: List[str] = []
  653 + seen: set[str] = set()
  654 + for _, raw in per_intent.values():
  655 + if raw and raw not in seen:
  656 + seen.add(raw)
  657 + parts.append(raw)
  658 + return " ".join(parts).strip()
328 659
329 @staticmethod 660 @staticmethod
330 - def _apply_decision_to_source(source: Dict[str, Any], decision: SkuSelectionDecision) -> None: 661 + def _find_sku_by_id(
  662 + skus: List[Dict[str, Any]], sku_id: Optional[str]
  663 + ) -> Optional[Dict[str, Any]]:
  664 + if not sku_id:
  665 + return None
  666 + for sku in skus:
  667 + if str(sku.get("sku_id") or "") == sku_id:
  668 + return sku
  669 + return None
  670 +
  671 + @staticmethod
  672 + def _build_selected_text(
  673 + sku: Dict[str, Any],
  674 + resolved_dimensions: Dict[str, Optional[str]],
  675 + ) -> str:
  676 + """Text carried into rerank doc suffix: joined raw values on the resolved slots."""
  677 + parts: List[str] = []
  678 + seen: set[str] = set()
  679 + for slot in resolved_dimensions.values():
  680 + if not slot:
  681 + continue
  682 + raw = str(sku.get(slot) or "").strip()
  683 + if raw and raw not in seen:
  684 + seen.add(raw)
  685 + parts.append(raw)
  686 + return " ".join(parts).strip()
  687 +
  688 + # ------------------------------------------------------------------
  689 + # Source mutation (applied after page fill)
  690 + # ------------------------------------------------------------------
  691 + @staticmethod
  692 + def _apply_decision_to_source(
  693 + source: Dict[str, Any], decision: SkuSelectionDecision
  694 + ) -> None:
  695 + if not decision.selected_sku_id:
  696 + return
331 skus = source.get("skus") 697 skus = source.get("skus")
332 - if not isinstance(skus, list) or not skus or not decision.selected_sku_id: 698 + if not isinstance(skus, list) or not skus:
333 return 699 return
334 -  
335 - selected_index = None 700 + selected_index: Optional[int] = None
336 for index, sku in enumerate(skus): 701 for index, sku in enumerate(skus):
337 if str(sku.get("sku_id") or "") == decision.selected_sku_id: 702 if str(sku.get("sku_id") or "") == decision.selected_sku_id:
338 selected_index = index 703 selected_index = index
339 break 704 break
340 if selected_index is None: 705 if selected_index is None:
341 return 706 return
342 -  
343 selected_sku = skus.pop(selected_index) 707 selected_sku = skus.pop(selected_index)
344 skus.insert(0, selected_sku) 708 skus.insert(0, selected_sku)
345 -  
346 image_src = selected_sku.get("image_src") or selected_sku.get("imageSrc") 709 image_src = selected_sku.get("image_src") or selected_sku.get("imageSrc")
347 if image_src: 710 if image_src:
348 source["image_url"] = image_src 711 source["image_url"] = image_src
  712 +
  713 +
  714 +def _iter_multilingual_texts(value: Any) -> List[str]:
  715 + """Flatten a value that may be str, list, or multilingual dict {zh, en, ...}."""
  716 + if value is None:
  717 + return []
  718 + if isinstance(value, str):
  719 + return [value] if value.strip() else []
  720 + if isinstance(value, dict):
  721 + out: List[str] = []
  722 + for v in value.values():
  723 + out.extend(_iter_multilingual_texts(v))
  724 + return out
  725 + if isinstance(value, (list, tuple)):
  726 + out = []
  727 + for v in value:
  728 + out.extend(_iter_multilingual_texts(v))
  729 + return out
  730 + return []
tests/test_search_rerank_window.py
@@ -231,19 +231,6 @@ def _build_searcher(config: SearchConfig, es_client: _FakeESClient) -> Searcher: @@ -231,19 +231,6 @@ def _build_searcher(config: SearchConfig, es_client: _FakeESClient) -> Searcher:
231 return searcher 231 return searcher
232 232
233 233
234 -class _FakeTextEncoder:  
235 - def __init__(self, vectors: Dict[str, List[float]]):  
236 - self.vectors = {  
237 - key: np.array(value, dtype=np.float32)  
238 - for key, value in vectors.items()  
239 - }  
240 -  
241 - def encode(self, sentences, priority: int = 0, **kwargs):  
242 - if isinstance(sentences, str):  
243 - sentences = [sentences]  
244 - return np.array([self.vectors[text] for text in sentences], dtype=object)  
245 -  
246 -  
247 def test_config_loader_rerank_enabled_defaults_true(tmp_path: Path): 234 def test_config_loader_rerank_enabled_defaults_true(tmp_path: Path):
248 config_data = { 235 config_data = {
249 "es_index_name": "test_products", 236 "es_index_name": "test_products",
@@ -611,7 +598,14 @@ def test_searcher_rerank_prefetch_source_includes_sku_fields_when_style_intent_a @@ -611,7 +598,14 @@ def test_searcher_rerank_prefetch_source_includes_sku_fields_when_style_intent_a
611 598
612 assert es_client.calls[0]["body"]["_source"] is False 599 assert es_client.calls[0]["body"]["_source"] is False
613 assert es_client.calls[1]["body"]["_source"] == { 600 assert es_client.calls[1]["body"]["_source"] == {
614 - "includes": ["option1_name", "option2_name", "option3_name", "skus", "title"] 601 + "includes": [
  602 + "enriched_taxonomy_attributes",
  603 + "option1_name",
  604 + "option2_name",
  605 + "option3_name",
  606 + "skus",
  607 + "title",
  608 + ]
615 } 609 }
616 610
617 611
@@ -944,78 +938,6 @@ def test_searcher_skips_sku_selection_when_option_name_does_not_match_dimension_ @@ -944,78 +938,6 @@ def test_searcher_skips_sku_selection_when_option_name_does_not_match_dimension_
944 assert result.results[0].image_url == "https://img/default.jpg" 938 assert result.results[0].image_url == "https://img/default.jpg"
945 939
946 940
947 -def test_searcher_promotes_sku_by_embedding_when_query_has_no_direct_option_match(monkeypatch):  
948 - es_client = _FakeESClient(total_hits=1)  
949 - searcher = _build_searcher(_build_search_config(rerank_enabled=False), es_client)  
950 - context = create_request_context(reqid="sku-embed", uid="u-sku-embed")  
951 -  
952 - monkeypatch.setattr(  
953 - "search.searcher.get_tenant_config_loader",  
954 - lambda: SimpleNamespace(get_tenant_config=lambda tenant_id: {"index_languages": ["en"]}),  
955 - )  
956 -  
957 - encoder = _FakeTextEncoder(  
958 - {  
959 - "linen summer dress": [0.8, 0.2],  
960 - "red": [1.0, 0.0],  
961 - "blue": [0.0, 1.0],  
962 - }  
963 - )  
964 -  
965 - class _EmbeddingQueryParser:  
966 - text_encoder = encoder  
967 -  
968 - def parse(  
969 - self,  
970 - query: str,  
971 - tenant_id: str,  
972 - generate_vector: bool,  
973 - context: Any,  
974 - target_languages: Any = None,  
975 - ):  
976 - return _FakeParsedQuery(  
977 - original_query=query,  
978 - query_normalized=query,  
979 - rewritten_query=query,  
980 - translations={},  
981 - query_vector=np.array([0.0, 1.0], dtype=np.float32),  
982 - style_intent_profile=_build_style_intent_profile(  
983 - "color", "blue", "color", "colors", "颜色"  
984 - ),  
985 - )  
986 -  
987 - searcher.query_parser = _EmbeddingQueryParser()  
988 -  
989 - def _full_source_with_skus(doc_id: str) -> Dict[str, Any]:  
990 - return {  
991 - "spu_id": doc_id,  
992 - "title": {"en": f"product-{doc_id}"},  
993 - "brief": {"en": f"brief-{doc_id}"},  
994 - "vendor": {"en": f"vendor-{doc_id}"},  
995 - "option1_name": "Color",  
996 - "image_url": "https://img/default.jpg",  
997 - "skus": [  
998 - {"sku_id": "sku-red", "option1_value": "Red", "image_src": "https://img/red.jpg"},  
999 - {"sku_id": "sku-blue", "option1_value": "Blue", "image_src": "https://img/blue.jpg"},  
1000 - ],  
1001 - }  
1002 -  
1003 - monkeypatch.setattr(_FakeESClient, "_full_source", staticmethod(_full_source_with_skus))  
1004 -  
1005 - result = searcher.search(  
1006 - query="linen summer dress",  
1007 - tenant_id="162",  
1008 - from_=0,  
1009 - size=1,  
1010 - context=context,  
1011 - enable_rerank=False,  
1012 - )  
1013 -  
1014 - assert len(result.results) == 1  
1015 - assert result.results[0].skus[0].sku_id == "sku-blue"  
1016 - assert result.results[0].image_url == "https://img/blue.jpg"  
1017 -  
1018 -  
1019 def test_searcher_debug_info_uses_initial_es_max_score_for_normalization(monkeypatch): 941 def test_searcher_debug_info_uses_initial_es_max_score_for_normalization(monkeypatch):
1020 es_client = _FakeESClient(total_hits=3) 942 es_client = _FakeESClient(total_hits=3)
1021 cfg = _build_search_config(rerank_enabled=False) 943 cfg = _build_search_config(rerank_enabled=False)
tests/test_sku_intent_selector.py
1 from types import SimpleNamespace 1 from types import SimpleNamespace
2 2
  3 +import pytest
  4 +
3 from config import QueryConfig 5 from config import QueryConfig
4 from query.style_intent import DetectedStyleIntent, StyleIntentProfile, StyleIntentRegistry 6 from query.style_intent import DetectedStyleIntent, StyleIntentProfile, StyleIntentRegistry
5 from search.sku_intent_selector import StyleSkuSelector 7 from search.sku_intent_selector import StyleSkuSelector
@@ -57,7 +59,9 @@ def test_style_sku_selector_matches_first_sku_by_attribute_terms(): @@ -57,7 +59,9 @@ def test_style_sku_selector_matches_first_sku_by_attribute_terms():
57 59
58 assert decision.selected_sku_id == "2" 60 assert decision.selected_sku_id == "2"
59 assert decision.selected_text == "Navy Blue X-Large" 61 assert decision.selected_text == "Navy Blue X-Large"
60 - assert decision.matched_stage == "text" 62 + assert decision.final_source == "option"
  63 + assert decision.matched_sources == {"color": "option", "size": "option"}
  64 + assert decision.matched_stage == "option" # back-compat alias
61 65
62 selector.apply_precomputed_decisions(hits, decisions) 66 selector.apply_precomputed_decisions(hits, decisions)
63 67
@@ -103,7 +107,7 @@ def test_style_sku_selector_returns_no_match_without_attribute_contains(): @@ -103,7 +107,7 @@ def test_style_sku_selector_returns_no_match_without_attribute_contains():
103 decisions = selector.prepare_hits(hits, parsed_query) 107 decisions = selector.prepare_hits(hits, parsed_query)
104 108
105 assert decisions["spu-1"].selected_sku_id is None 109 assert decisions["spu-1"].selected_sku_id is None
106 - assert decisions["spu-1"].matched_stage == "no_match" 110 + assert decisions["spu-1"].final_source == "none"
107 111
108 112
109 def test_is_text_match_uses_token_boundaries_for_sizes(): 113 def test_is_text_match_uses_token_boundaries_for_sizes():
@@ -195,3 +199,341 @@ def test_is_text_match_handles_punctuation_and_descriptive_attribute_values(): @@ -195,3 +199,341 @@ def test_is_text_match_handles_punctuation_and_descriptive_attribute_values():
195 assert selector._is_text_match("style", selection_context, normalized_value="off-white/lined") 199 assert selector._is_text_match("style", selection_context, normalized_value="off-white/lined")
196 assert selector._is_text_match("accessory", selection_context, normalized_value="army green + headscarf") 200 assert selector._is_text_match("accessory", selection_context, normalized_value="army green + headscarf")
197 assert selector._is_text_match("size", selection_context, normalized_value="2xl recommended 65-70kg") 201 assert selector._is_text_match("size", selection_context, normalized_value="2xl recommended 65-70kg")
  202 +
  203 +
  204 +def _khaki_intent() -> DetectedStyleIntent:
  205 + """Mirrors what StyleIntentDetector now emits (all_terms union of zh/en/attribute)."""
  206 + return DetectedStyleIntent(
  207 + intent_type="color",
  208 + canonical_value="beige",
  209 + matched_term="卡其色",
  210 + matched_query_text="卡其色",
  211 + attribute_terms=("beige", "khaki"),
  212 + dimension_aliases=("color", "颜色"),
  213 + all_terms=("米色", "卡其色", "beige", "khaki"),
  214 + )
  215 +
  216 +
  217 +def _color_registry() -> StyleIntentRegistry:
  218 + return StyleIntentRegistry.from_query_config(
  219 + QueryConfig(
  220 + style_intent_terms={
  221 + "color": [
  222 + {
  223 + "en_terms": ["beige", "khaki"],
  224 + "zh_terms": ["米色", "卡其色"],
  225 + "attribute_terms": ["beige", "khaki"],
  226 + }
  227 + ],
  228 + },
  229 + style_intent_dimension_aliases={"color": ["color", "颜色"]},
  230 + )
  231 + )
  232 +
  233 +
  234 +def test_zh_color_intent_matches_noisy_option_value():
  235 + """卡其色裙子 → SKU 的 option1_value 以"卡其色"开头但带 V 领等后缀,也应命中。"""
  236 + selector = StyleSkuSelector(_color_registry())
  237 + parsed_query = SimpleNamespace(
  238 + style_intent_profile=StyleIntentProfile(intents=(_khaki_intent(),))
  239 + )
  240 + hits = [
  241 + {
  242 + "_id": "spu-1",
  243 + "_source": {
  244 + "option1_name": "颜色",
  245 + "skus": [
  246 + {"sku_id": "1", "option1_value": "黑色长裙"},
  247 + {"sku_id": "2", "option1_value": "卡其色v领收腰长裙【常规款】"},
  248 + ],
  249 + },
  250 + }
  251 + ]
  252 + decisions = selector.prepare_hits(hits, parsed_query)
  253 + assert decisions["spu-1"].selected_sku_id == "2"
  254 + assert decisions["spu-1"].final_source == "option"
  255 +
  256 +
  257 +@pytest.mark.parametrize(
  258 + "option_value",
  259 + [
  260 + "卡其色(无内衬)",
  261 + "卡其色(无内衬)",
  262 + "卡其色【常规款】",
  263 + "卡其色/常规款",
  264 + "卡其色·无内衬",
  265 + "卡其色 - 常规",
  266 + "卡其色,常规",
  267 + "卡其色|常规",
  268 + "卡其色—加厚",
  269 + ],
  270 +)
  271 +def test_zh_color_intent_matches_various_brackets_and_separators(option_value: str):
  272 + selector = StyleSkuSelector(_color_registry())
  273 + parsed_query = SimpleNamespace(
  274 + style_intent_profile=StyleIntentProfile(intents=(_khaki_intent(),))
  275 + )
  276 + hits = [
  277 + {
  278 + "_id": "spu-1",
  279 + "_source": {
  280 + "option1_name": "颜色",
  281 + "skus": [
  282 + {"sku_id": "441670", "option1_value": "白色(无内衬)"},
  283 + {"sku_id": "441679", "option1_value": option_value},
  284 + ],
  285 + },
  286 + }
  287 + ]
  288 + assert selector.prepare_hits(hits, parsed_query)["spu-1"].selected_sku_id == "441679"
  289 +
  290 +
  291 +def test_zh_color_intent_matches_noisy_option_value_with_fullwidth_parens():
  292 + """卡其色(无内衬) 是前面 taxonomy-override bug 的实地复现;option 分支现在必须命中。"""
  293 + selector = StyleSkuSelector(_color_registry())
  294 + parsed_query = SimpleNamespace(
  295 + style_intent_profile=StyleIntentProfile(intents=(_khaki_intent(),))
  296 + )
  297 + hits = [
  298 + {
  299 + "_id": "spu-1",
  300 + "_source": {
  301 + "option1_name": "颜色",
  302 + # Even if SPU-level taxonomy existed, white SKU must NOT leak in.
  303 + "enriched_taxonomy_attributes": [
  304 + {"name": "Color", "value": {"zh": "卡其色"}}
  305 + ],
  306 + "skus": [
  307 + {"sku_id": "441670", "option1_value": "白色(无内衬)"},
  308 + {"sku_id": "441679", "option1_value": "卡其色(无内衬)"},
  309 + ],
  310 + },
  311 + }
  312 + ]
  313 + decisions = selector.prepare_hits(hits, parsed_query)
  314 + d = decisions["spu-1"]
  315 + assert d.selected_sku_id == "441679"
  316 + assert d.selected_text == "卡其色(无内衬)"
  317 + assert d.final_source == "option"
  318 + assert d.matched_sources == {"color": "option"}
  319 +
  320 +
  321 +def test_taxonomy_attribute_extends_text_matching_source():
  322 + """即使 optionN 无法区分 SKU,enriched_taxonomy_attributes 的 Color 也可让 SPU 全部 SKU 通过文本
  323 + 匹配,之后由图像 pick(若有)决定具体 SKU;无图像则取首个。"""
  324 + selector = StyleSkuSelector(_color_registry())
  325 + parsed_query = SimpleNamespace(
  326 + style_intent_profile=StyleIntentProfile(intents=(_khaki_intent(),))
  327 + )
  328 + hits = [
  329 + {
  330 + "_id": "spu-1",
  331 + "_source": {
  332 + "option1_name": "Style", # unrelated dimension → slot unresolved
  333 + "enriched_taxonomy_attributes": [
  334 + {"name": "Color", "value": {"zh": "卡其色", "en": "khaki"}}
  335 + ],
  336 + "skus": [
  337 + {"sku_id": "a", "option1_value": "A"},
  338 + {"sku_id": "b", "option1_value": "B"},
  339 + ],
  340 + },
  341 + }
  342 + ]
  343 + decisions = selector.prepare_hits(hits, parsed_query)
  344 + # Taxonomy matches → both SKUs text-matched; no image pick → first one wins.
  345 + d = decisions["spu-1"]
  346 + assert d.selected_sku_id == "a"
  347 + assert d.final_source == "taxonomy"
  348 + # selected_text reflects the real matched taxonomy value, not SKU's unrelated option.
  349 + assert d.selected_text == "卡其色"
  350 + assert d.matched_sources == {"color": "taxonomy"}
  351 +
  352 +
  353 +def test_taxonomy_does_not_override_contradicting_sku_option_value():
  354 + """SPU 级 taxonomy 说"卡其色",但 SKU 自己的 option1_value 是"白色(无内衬)",
  355 + 该 SKU 不应被视作文本命中——避免 SPU 级信号把错色 SKU 顶上去。"""
  356 + selector = StyleSkuSelector(_color_registry())
  357 + parsed_query = SimpleNamespace(
  358 + style_intent_profile=StyleIntentProfile(intents=(_khaki_intent(),))
  359 + )
  360 + hits = [
  361 + {
  362 + "_id": "spu-1",
  363 + "_source": {
  364 + "option1_name": "颜色",
  365 + "enriched_taxonomy_attributes": [
  366 + {"name": "Color", "value": {"zh": "卡其色", "en": "khaki"}}
  367 + ],
  368 + "skus": [
  369 + {"sku_id": "white", "option1_value": "白色(无内衬)"},
  370 + {"sku_id": "khaki", "option1_value": "卡其色棉"},
  371 + ],
  372 + },
  373 + }
  374 + ]
  375 + decisions = selector.prepare_hits(hits, parsed_query)
  376 + # 只有 khaki 自有值匹配;taxonomy 不会把 white 顶出来。
  377 + assert decisions["spu-1"].selected_sku_id == "khaki"
  378 + assert decisions["spu-1"].final_source == "option"
  379 +
  380 +
  381 +def test_taxonomy_fills_in_only_when_sku_self_value_is_empty():
  382 + """混合场景:SKU 1 无 option1_value → taxonomy 接管;SKU 2 自带白色 → 不匹配。"""
  383 + selector = StyleSkuSelector(_color_registry())
  384 + parsed_query = SimpleNamespace(
  385 + style_intent_profile=StyleIntentProfile(intents=(_khaki_intent(),))
  386 + )
  387 + hits = [
  388 + {
  389 + "_id": "spu-1",
  390 + "_source": {
  391 + "option1_name": "颜色",
  392 + "enriched_taxonomy_attributes": [
  393 + {"name": "Color", "value": {"zh": "卡其色"}}
  394 + ],
  395 + "skus": [
  396 + {"sku_id": "no-value", "option1_value": ""},
  397 + {"sku_id": "white", "option1_value": "白色"},
  398 + ],
  399 + },
  400 + }
  401 + ]
  402 + decisions = selector.prepare_hits(hits, parsed_query)
  403 + d = decisions["spu-1"]
  404 + assert d.selected_sku_id == "no-value"
  405 + assert d.final_source == "taxonomy"
  406 + assert d.selected_text == "卡其色"
  407 +
  408 +
  409 +def test_image_pick_serves_as_visual_tiebreak_within_text_matched():
  410 + selector = StyleSkuSelector(_color_registry())
  411 + parsed_query = SimpleNamespace(
  412 + style_intent_profile=StyleIntentProfile(intents=(_khaki_intent(),))
  413 + )
  414 + hits = [
  415 + {
  416 + "_id": "spu-1",
  417 + "_source": {
  418 + "option1_name": "颜色",
  419 + "skus": [
  420 + {
  421 + "sku_id": "khaki-cotton",
  422 + "option1_value": "卡其色棉",
  423 + "image_src": "https://cdn/x/khaki-cotton.jpg",
  424 + },
  425 + {
  426 + "sku_id": "khaki-linen",
  427 + "option1_value": "卡其色麻",
  428 + "image_src": "https://cdn/x/khaki-linen.jpg",
  429 + },
  430 + ],
  431 + },
  432 + "inner_hits": {
  433 + "exact_image_knn_query_hits": {
  434 + "hits": {
  435 + "hits": [
  436 + {
  437 + "_score": 0.87,
  438 + "_source": {"url": "https://cdn/x/khaki-linen.jpg"},
  439 + }
  440 + ]
  441 + }
  442 + }
  443 + },
  444 + }
  445 + ]
  446 + decisions = selector.prepare_hits(hits, parsed_query)
  447 + decision = decisions["spu-1"]
  448 + assert decision.selected_sku_id == "khaki-linen"
  449 + assert decision.final_source == "option"
  450 + assert decision.image_pick_sku_id == "khaki-linen"
  451 + assert decision.image_pick_score == 0.87
  452 +
  453 +
  454 +def test_image_only_selection_when_no_style_intent():
  455 + """无款式意图:仅凭 image_embedding 最近邻 SKU,直接把该 SKU 置顶。"""
  456 + selector = StyleSkuSelector(_color_registry())
  457 + parsed_query = SimpleNamespace(style_intent_profile=None)
  458 + hits = [
  459 + {
  460 + "_id": "spu-1",
  461 + "_source": {
  462 + "option1_name": "Color",
  463 + "image_url": "https://cdn/x/default.jpg",
  464 + "skus": [
  465 + {
  466 + "sku_id": "red",
  467 + "option1_value": "Red",
  468 + "image_src": "https://cdn/x/red.jpg",
  469 + },
  470 + {
  471 + "sku_id": "blue",
  472 + "option1_value": "Blue",
  473 + "image_src": "https://cdn/x/blue.jpg",
  474 + },
  475 + ],
  476 + },
  477 + "inner_hits": {
  478 + "image_knn_query_hits": {
  479 + "hits": {
  480 + "hits": [
  481 + {"_score": 0.74, "_source": {"url": "https://cdn/x/blue.jpg"}}
  482 + ]
  483 + }
  484 + }
  485 + },
  486 + }
  487 + ]
  488 + decisions = selector.prepare_hits(hits, parsed_query)
  489 + decision = decisions["spu-1"]
  490 + assert decision.selected_sku_id == "blue"
  491 + assert decision.final_source == "image"
  492 +
  493 + selector.apply_precomputed_decisions(hits, decisions)
  494 + source = hits[0]["_source"]
  495 + assert source["skus"][0]["sku_id"] == "blue"
  496 + assert source["image_url"] == "https://cdn/x/blue.jpg"
  497 +
  498 +
  499 +def test_image_pick_ignored_when_text_matches_but_visual_url_not_in_text_set():
  500 + """文本命中优先:image-pick 若落在非文本命中 SKU,则不接管。"""
  501 + selector = StyleSkuSelector(_color_registry())
  502 + parsed_query = SimpleNamespace(
  503 + style_intent_profile=StyleIntentProfile(intents=(_khaki_intent(),))
  504 + )
  505 + hits = [
  506 + {
  507 + "_id": "spu-1",
  508 + "_source": {
  509 + "option1_name": "颜色",
  510 + "skus": [
  511 + {
  512 + "sku_id": "khaki",
  513 + "option1_value": "卡其色",
  514 + "image_src": "https://cdn/x/khaki.jpg",
  515 + },
  516 + {
  517 + "sku_id": "black",
  518 + "option1_value": "黑色",
  519 + "image_src": "https://cdn/x/black.jpg",
  520 + },
  521 + ],
  522 + },
  523 + "inner_hits": {
  524 + "exact_image_knn_query_hits": {
  525 + "hits": {
  526 + "hits": [
  527 + {"_score": 0.9, "_source": {"url": "https://cdn/x/black.jpg"}}
  528 + ]
  529 + }
  530 + }
  531 + },
  532 + }
  533 + ]
  534 + decisions = selector.prepare_hits(hits, parsed_query)
  535 + decision = decisions["spu-1"]
  536 + # Hard text-first: khaki stays, though image pointed at black.
  537 + assert decision.selected_sku_id == "khaki"
  538 + assert decision.final_source == "option"
  539 + assert decision.image_pick_sku_id == "black"