Commit 5c9baf9193552091e5714709721f6ddfa6c3d4a8
1 parent
d6c29734
feat(search): 款式意图下统一 SKU 选取(option/taxonomy/图像)与属性值匹配增强
## 主要能力
- 在 rerank 窗口内对 hits 做 SKU 预决策:款式意图(多源同义词)+ 图像 KNN inner_hits URL 对齐 SKU.image_src,统一一次决策、无级联 fallback。
- 区分文本证据强度:final_source ∈ {option, taxonomy, image, none};matched_sources 按意图记录 option 或 taxonomy;selected_text / rerank_suffix 回填真实命中片段(SKU option 原文或 taxonomy value 原文)。
- 权威规则:SKU 在已解析维度上有非空 option 值时仅以该值参与匹配;SPU 级 enriched_taxonomy_attributes 不覆盖与之一致的 SKU 级矛盾值(修复「taxonomy 把白色 SKU 当卡其色命中」)。
- 图像:nested image KNN / exact rescore 增加 inner_hits(url),用于 SKU 置顶时的视觉 tie-break(仅在文本命中集内)或无意图时纯图像置顶。
- 查询侧:DetectedStyleIntent 增加 all_terms(zh+en+attribute 并集),属性值匹配与意图词表一致。
- API:SpuResult 透出 enriched_attributes / enriched_taxonomy_attributes(避免 Pydantic 丢弃 ES 字段)。
## 属性值匹配(括号和分隔符)
- 在分词前对归一化后的 option/taxonomy 字符串执行 _with_segment_boundaries_for_matching:将全/半角括号、斜杠、顿号、中英文标点、中点、各类横线等替换为空格,再 simple_tokenize + 滑窗;无分隔的连续汉字仍走纯中文子串回退(如 卡其色棉)。
- 参数化测试覆盖多种括号与常见电商分隔写法。
## 编排与配置
- searcher:_should_run_sku_selection = 款式意图激活 或 存在 image_query_vector;prefetch _source 含 skus、option 名、enriched_taxonomy_attributes。
- es_query_builder:image knn / exact image rescore 的 nested 子句带 inner_hits。
## 测试与仓库
- tests/test_sku_intent_selector.py、tests/test_search_rerank_window.py 更新;移除已废弃的 embedding-fallback 集成断言。
- .gitignore:忽略 artifacts/search_evaluation/datasets/(本地评估大数据集,避免误提交)。
Made-with: Cursor
Showing
11 changed files
with
969 additions
and
256 deletions
Show diff stats
.gitignore
| @@ -82,3 +82,4 @@ model_cache/ | @@ -82,3 +82,4 @@ model_cache/ | ||
| 82 | artifacts/search_evaluation/*.sqlite3 | 82 | artifacts/search_evaluation/*.sqlite3 |
| 83 | artifacts/search_evaluation/batch_reports/ | 83 | artifacts/search_evaluation/batch_reports/ |
| 84 | artifacts/search_evaluation/tuning_runs/ | 84 | artifacts/search_evaluation/tuning_runs/ |
| 85 | +artifacts/search_evaluation/datasets/ |
api/models.py
| @@ -276,6 +276,14 @@ class SpuResult(BaseModel): | @@ -276,6 +276,14 @@ class SpuResult(BaseModel): | ||
| 276 | None, | 276 | None, |
| 277 | description="规格列表(与 ES specifications 字段对应)" | 277 | description="规格列表(与 ES specifications 字段对应)" |
| 278 | ) | 278 | ) |
| 279 | + enriched_attributes: Optional[List[Dict[str, Any]]] = Field( | ||
| 280 | + None, | ||
| 281 | + description="LLM 富化属性(ES enriched_attributes 字段)" | ||
| 282 | + ) | ||
| 283 | + enriched_taxonomy_attributes: Optional[List[Dict[str, Any]]] = Field( | ||
| 284 | + None, | ||
| 285 | + description="类目体系化属性(ES enriched_taxonomy_attributes 字段,例如 Color/Material)" | ||
| 286 | + ) | ||
| 279 | skus: List[SkuResult] = Field(default_factory=list, description="SKU列表") | 287 | skus: List[SkuResult] = Field(default_factory=list, description="SKU列表") |
| 280 | relevance_score: float = Field(..., ge=0.0, description="相关性分数(ES原始分数)") | 288 | relevance_score: float = Field(..., ge=0.0, description="相关性分数(ES原始分数)") |
| 281 | 289 |
api/result_formatter.py
| @@ -150,6 +150,8 @@ class ResultFormatter: | @@ -150,6 +150,8 @@ class ResultFormatter: | ||
| 150 | option2_name=source.get('option2_name'), | 150 | option2_name=source.get('option2_name'), |
| 151 | option3_name=source.get('option3_name'), | 151 | option3_name=source.get('option3_name'), |
| 152 | specifications=source.get('specifications'), | 152 | specifications=source.get('specifications'), |
| 153 | + enriched_attributes=source.get('enriched_attributes'), | ||
| 154 | + enriched_taxonomy_attributes=source.get('enriched_taxonomy_attributes'), | ||
| 153 | skus=skus, | 155 | skus=skus, |
| 154 | relevance_score=relevance_score | 156 | relevance_score=relevance_score |
| 155 | ) | 157 | ) |
config/config.yaml
| @@ -224,8 +224,8 @@ query_config: | @@ -224,8 +224,8 @@ query_config: | ||
| 224 | # - keywords | 224 | # - keywords |
| 225 | # - qanchors | 225 | # - qanchors |
| 226 | # - enriched_tags | 226 | # - enriched_tags |
| 227 | - # - enriched_attributes | ||
| 228 | - # - # enriched_taxonomy_attributes.value | 227 | + - enriched_attributes |
| 228 | + - enriched_taxonomy_attributes | ||
| 229 | 229 | ||
| 230 | - min_price | 230 | - min_price |
| 231 | - compare_at_price | 231 | - compare_at_price |
docs/搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md
| @@ -275,7 +275,7 @@ curl "http://localhost:6007/health" | @@ -275,7 +275,7 @@ curl "http://localhost:6007/health" | ||
| 275 | | `target_lang` | string | Y | 目标语言:`zh`、`en`、`ru` 等 | | 275 | | `target_lang` | string | Y | 目标语言:`zh`、`en`、`ru` 等 | |
| 276 | | `source_lang` | string | N | 源语言。云端模型可不传;`nllb-200-distilled-600m` 建议显式传入 | | 276 | | `source_lang` | string | N | 源语言。云端模型可不传;`nllb-200-distilled-600m` 建议显式传入 | |
| 277 | | `model` | string | N | 已启用 capability 名称,如 `qwen-mt`、`llm`、`deepl`、`nllb-200-distilled-600m`、`opus-mt-zh-en`、`opus-mt-en-zh` | | 277 | | `model` | string | N | 已启用 capability 名称,如 `qwen-mt`、`llm`、`deepl`、`nllb-200-distilled-600m`、`opus-mt-zh-en`、`opus-mt-en-zh` | |
| 278 | -| `scene` | string | N | 翻译场景参数,与 `model` 配套使用;当前标准值为 `sku_name`、`ecommerce_search_query`、`general` | | 278 | +| `scene` | string | N | 翻译场景参数,与 `model` 配套使用;当前标准值为 `sku_name`、`ecommerce_search_query`、`sku_attribute`、`general` | |
| 279 | 279 | ||
| 280 | 说明: | 280 | 说明: |
| 281 | - 外部接口不接受 `prompt`;LLM prompt 由服务端按 `scene` 自动生成。 | 281 | - 外部接口不接受 `prompt`;LLM prompt 由服务端按 `scene` 自动生成。 |
| @@ -287,7 +287,7 @@ curl "http://localhost:6007/health" | @@ -287,7 +287,7 @@ curl "http://localhost:6007/health" | ||
| 287 | - 如果是en-zh互译、期待更高的速度,可以考虑`opus-mt-zh-en` / `opus-mt-en-zh`。(质量未详细评测,一些文章说比blib-200-600m更好,但是我看了些case感觉要差不少) | 287 | - 如果是en-zh互译、期待更高的速度,可以考虑`opus-mt-zh-en` / `opus-mt-en-zh`。(质量未详细评测,一些文章说比blib-200-600m更好,但是我看了些case感觉要差不少) |
| 288 | 288 | ||
| 289 | **实时翻译选型建议**: | 289 | **实时翻译选型建议**: |
| 290 | -- 在线 query 翻译如果只是 `en/zh` 互译,极致要求耗时使用 `opus-mt-zh-en / opus-mt-en-zh`,`nllb-200-distilled-600m`支持多语言,效果略好一点,但是耗时长很多(70-150ms之间) | 290 | +- 在线 query 翻译如果只是 `en/zh` 互译,极致要求耗时使用 `opus-mt-zh-en / opus-mt-en-zh`,`nllb-200-distilled-600m`支持多语言,效果略好一点,但是耗时长很多(120-190ms左右) |
| 291 | - 如果涉及其他语言,或对质量要求高于本地轻量模型,优先考虑 `deepl`。 | 291 | - 如果涉及其他语言,或对质量要求高于本地轻量模型,优先考虑 `deepl`。 |
| 292 | 292 | ||
| 293 | **Batch Size / 调用方式建议**: | 293 | **Batch Size / 调用方式建议**: |
query/style_intent.py
| @@ -134,6 +134,10 @@ class DetectedStyleIntent: | @@ -134,6 +134,10 @@ class DetectedStyleIntent: | ||
| 134 | matched_query_text: str | 134 | matched_query_text: str |
| 135 | attribute_terms: Tuple[str, ...] | 135 | attribute_terms: Tuple[str, ...] |
| 136 | dimension_aliases: Tuple[str, ...] | 136 | dimension_aliases: Tuple[str, ...] |
| 137 | + # Union of zh_terms + en_terms + attribute_terms for the matched term definition. | ||
| 138 | + # Downstream SKU-selection treats every entry as a valid attribute-value match candidate | ||
| 139 | + # so a Chinese user query like "卡其色" can match a Chinese option value "卡其色裙". | ||
| 140 | + all_terms: Tuple[str, ...] = () | ||
| 137 | 141 | ||
| 138 | def to_dict(self) -> Dict[str, Any]: | 142 | def to_dict(self) -> Dict[str, Any]: |
| 139 | return { | 143 | return { |
| @@ -143,8 +147,14 @@ class DetectedStyleIntent: | @@ -143,8 +147,14 @@ class DetectedStyleIntent: | ||
| 143 | "matched_query_text": self.matched_query_text, | 147 | "matched_query_text": self.matched_query_text, |
| 144 | "attribute_terms": list(self.attribute_terms), | 148 | "attribute_terms": list(self.attribute_terms), |
| 145 | "dimension_aliases": list(self.dimension_aliases), | 149 | "dimension_aliases": list(self.dimension_aliases), |
| 150 | + "all_terms": list(self.all_terms), | ||
| 146 | } | 151 | } |
| 147 | 152 | ||
| 153 | + @property | ||
| 154 | + def matching_terms(self) -> Tuple[str, ...]: | ||
| 155 | + """Terms usable for attribute-value matching; falls back to attribute_terms for old callers.""" | ||
| 156 | + return self.all_terms or self.attribute_terms | ||
| 157 | + | ||
| 148 | 158 | ||
| 149 | @dataclass(frozen=True) | 159 | @dataclass(frozen=True) |
| 150 | class StyleIntentProfile: | 160 | class StyleIntentProfile: |
| @@ -370,6 +380,15 @@ class StyleIntentDetector: | @@ -370,6 +380,15 @@ class StyleIntentDetector: | ||
| 370 | if pair in seen_pairs: | 380 | if pair in seen_pairs: |
| 371 | continue | 381 | continue |
| 372 | seen_pairs.add(pair) | 382 | seen_pairs.add(pair) |
| 383 | + all_terms = tuple( | ||
| 384 | + dict.fromkeys( | ||
| 385 | + ( | ||
| 386 | + *term_definition.zh_terms, | ||
| 387 | + *term_definition.en_terms, | ||
| 388 | + *term_definition.attribute_terms, | ||
| 389 | + ) | ||
| 390 | + ) | ||
| 391 | + ) | ||
| 373 | detected.append( | 392 | detected.append( |
| 374 | DetectedStyleIntent( | 393 | DetectedStyleIntent( |
| 375 | intent_type=intent_type, | 394 | intent_type=intent_type, |
| @@ -378,6 +397,7 @@ class StyleIntentDetector: | @@ -378,6 +397,7 @@ class StyleIntentDetector: | ||
| 378 | matched_query_text=variant.text, | 397 | matched_query_text=variant.text, |
| 379 | attribute_terms=term_definition.attribute_terms, | 398 | attribute_terms=term_definition.attribute_terms, |
| 380 | dimension_aliases=definition.dimension_aliases, | 399 | dimension_aliases=definition.dimension_aliases, |
| 400 | + all_terms=all_terms, | ||
| 381 | ) | 401 | ) |
| 382 | ) | 402 | ) |
| 383 | break | 403 | break |
search/es_query_builder.py
| @@ -213,6 +213,13 @@ class ESQueryBuilder: | @@ -213,6 +213,13 @@ class ESQueryBuilder: | ||
| 213 | "_name": query_name, | 213 | "_name": query_name, |
| 214 | "query": {"knn": image_knn_query}, | 214 | "query": {"knn": image_knn_query}, |
| 215 | "score_mode": "max", | 215 | "score_mode": "max", |
| 216 | + # Expose the best-matching image entry (url, score) so SKU selection | ||
| 217 | + # can promote the SKU whose image_src matches the winning url. | ||
| 218 | + "inner_hits": { | ||
| 219 | + "name": f"{query_name}_hits", | ||
| 220 | + "size": 1, | ||
| 221 | + "_source": ["url"], | ||
| 222 | + }, | ||
| 216 | } | 223 | } |
| 217 | } | 224 | } |
| 218 | return { | 225 | return { |
| @@ -276,6 +283,13 @@ class ESQueryBuilder: | @@ -276,6 +283,13 @@ class ESQueryBuilder: | ||
| 276 | "_name": query_name, | 283 | "_name": query_name, |
| 277 | "score_mode": "max", | 284 | "score_mode": "max", |
| 278 | "query": {"script_score": script_score_query}, | 285 | "query": {"script_score": script_score_query}, |
| 286 | + # Same rationale as build_image_knn_clause: carry the winning url + score | ||
| 287 | + # so downstream SKU selection can consume it without a second ES round-trip. | ||
| 288 | + "inner_hits": { | ||
| 289 | + "name": f"{query_name}_hits", | ||
| 290 | + "size": 1, | ||
| 291 | + "_source": ["url"], | ||
| 292 | + }, | ||
| 279 | } | 293 | } |
| 280 | } | 294 | } |
| 281 | return {"script_score": {"_name": query_name, **script_score_query}} | 295 | return {"script_score": {"_name": query_name, **script_score_query}} |
search/searcher.py
| @@ -354,7 +354,9 @@ class Searcher: | @@ -354,7 +354,9 @@ class Searcher: | ||
| 354 | if not includes: | 354 | if not includes: |
| 355 | includes.add("title") | 355 | includes.add("title") |
| 356 | 356 | ||
| 357 | - if self._has_style_intent(parsed_query): | 357 | + if self._should_run_sku_selection(parsed_query): |
| 358 | + # SKU-level fields are needed both by text matching (optionN_value) and | ||
| 359 | + # by the image pick (image_src) of the unified SKU selector. | ||
| 358 | includes.update( | 360 | includes.update( |
| 359 | { | 361 | { |
| 360 | "skus", | 362 | "skus", |
| @@ -363,6 +365,10 @@ class Searcher: | @@ -363,6 +365,10 @@ class Searcher: | ||
| 363 | "option3_name", | 365 | "option3_name", |
| 364 | } | 366 | } |
| 365 | ) | 367 | ) |
| 368 | + if self._has_style_intent(parsed_query): | ||
| 369 | + # Treated as an additional value source for attribute matching | ||
| 370 | + # (on the same dimension as optionN). | ||
| 371 | + includes.add("enriched_taxonomy_attributes") | ||
| 366 | 372 | ||
| 367 | return {"includes": sorted(includes)} | 373 | return {"includes": sorted(includes)} |
| 368 | 374 | ||
| @@ -435,6 +441,23 @@ class Searcher: | @@ -435,6 +441,23 @@ class Searcher: | ||
| 435 | profile = getattr(parsed_query, "style_intent_profile", None) | 441 | profile = getattr(parsed_query, "style_intent_profile", None) |
| 436 | return bool(getattr(profile, "is_active", False)) | 442 | return bool(getattr(profile, "is_active", False)) |
| 437 | 443 | ||
| 444 | + def _has_image_signal(self, parsed_query: Optional[ParsedQuery]) -> bool: | ||
| 445 | + """True when the query carries an image vector that can drive an image-based SKU pick.""" | ||
| 446 | + if parsed_query is None: | ||
| 447 | + return False | ||
| 448 | + if not getattr(self.config.query_config, "image_embedding_field", None): | ||
| 449 | + return False | ||
| 450 | + return getattr(parsed_query, "image_query_vector", None) is not None | ||
| 451 | + | ||
| 452 | + def _should_run_sku_selection(self, parsed_query: Optional[ParsedQuery]) -> bool: | ||
| 453 | + """Trigger unified SKU selection when either signal is present. | ||
| 454 | + | ||
| 455 | + Text-intent alone drives attribute-value matching; image signal alone drives | ||
| 456 | + image-nearest SKU promotion; together, image is a visual tie-breaker inside | ||
| 457 | + the text-matched set. | ||
| 458 | + """ | ||
| 459 | + return self._has_style_intent(parsed_query) or self._has_image_signal(parsed_query) | ||
| 460 | + | ||
| 438 | def _apply_style_intent_to_hits( | 461 | def _apply_style_intent_to_hits( |
| 439 | self, | 462 | self, |
| 440 | es_hits: List[Dict[str, Any]], | 463 | es_hits: List[Dict[str, Any]], |
| @@ -1067,7 +1090,7 @@ class Searcher: | @@ -1067,7 +1090,7 @@ class Searcher: | ||
| 1067 | if fill_took: | 1090 | if fill_took: |
| 1068 | es_response["took"] = int((es_response.get("took", 0) or 0) + fill_took) | 1091 | es_response["took"] = int((es_response.get("took", 0) or 0) + fill_took) |
| 1069 | 1092 | ||
| 1070 | - if self._has_style_intent(parsed_query): | 1093 | + if self._should_run_sku_selection(parsed_query): |
| 1071 | style_intent_decisions = self._apply_style_intent_to_hits( | 1094 | style_intent_decisions = self._apply_style_intent_to_hits( |
| 1072 | es_response.get("hits", {}).get("hits") or [], | 1095 | es_response.get("hits", {}).get("hits") or [], |
| 1073 | parsed_query, | 1096 | parsed_query, |
| @@ -1075,7 +1098,7 @@ class Searcher: | @@ -1075,7 +1098,7 @@ class Searcher: | ||
| 1075 | ) | 1098 | ) |
| 1076 | if style_intent_decisions: | 1099 | if style_intent_decisions: |
| 1077 | context.logger.info( | 1100 | context.logger.info( |
| 1078 | - "款式意图 SKU 预筛选完成 | hits=%s", | 1101 | + "SKU 选择预处理完成 | hits=%s", |
| 1079 | len(style_intent_decisions), | 1102 | len(style_intent_decisions), |
| 1080 | extra={'reqid': context.reqid, 'uid': context.uid} | 1103 | extra={'reqid': context.reqid, 'uid': context.uid} |
| 1081 | ) | 1104 | ) |
| @@ -1221,8 +1244,8 @@ class Searcher: | @@ -1221,8 +1244,8 @@ class Searcher: | ||
| 1221 | extra={'reqid': context.reqid, 'uid': context.uid} | 1244 | extra={'reqid': context.reqid, 'uid': context.uid} |
| 1222 | ) | 1245 | ) |
| 1223 | 1246 | ||
| 1224 | - # 非重排窗口:款式意图在 result_processing 之前执行,便于单独计时且与 ES 召回阶段衔接 | ||
| 1225 | - if self._has_style_intent(parsed_query) and not in_rank_window: | 1247 | + # 非重排窗口:SKU 选择(款式意图 OR 图像信号)在 result_processing 之前执行,便于单独计时 |
| 1248 | + if self._should_run_sku_selection(parsed_query) and not in_rank_window: | ||
| 1226 | es_hits_pre = es_response.get("hits", {}).get("hits") or [] | 1249 | es_hits_pre = es_response.get("hits", {}).get("hits") or [] |
| 1227 | style_intent_decisions = self._apply_style_intent_to_hits( | 1250 | style_intent_decisions = self._apply_style_intent_to_hits( |
| 1228 | es_hits_pre, | 1251 | es_hits_pre, |
| @@ -1251,12 +1274,11 @@ class Searcher: | @@ -1251,12 +1274,11 @@ class Searcher: | ||
| 1251 | coarse_debug_by_doc = _index_debug_rows_by_doc(context.get_intermediate_result('coarse_rank_scores', None)) | 1274 | coarse_debug_by_doc = _index_debug_rows_by_doc(context.get_intermediate_result('coarse_rank_scores', None)) |
| 1252 | fine_debug_by_doc = _index_debug_rows_by_doc(context.get_intermediate_result('fine_rank_scores', None)) | 1275 | fine_debug_by_doc = _index_debug_rows_by_doc(context.get_intermediate_result('fine_rank_scores', None)) |
| 1253 | 1276 | ||
| 1254 | - if self._has_style_intent(parsed_query): | ||
| 1255 | - if style_intent_decisions: | ||
| 1256 | - self.style_sku_selector.apply_precomputed_decisions( | ||
| 1257 | - es_hits, | ||
| 1258 | - style_intent_decisions, | ||
| 1259 | - ) | 1277 | + if self._should_run_sku_selection(parsed_query) and style_intent_decisions: |
| 1278 | + self.style_sku_selector.apply_precomputed_decisions( | ||
| 1279 | + es_hits, | ||
| 1280 | + style_intent_decisions, | ||
| 1281 | + ) | ||
| 1260 | 1282 | ||
| 1261 | # Format results using ResultFormatter | 1283 | # Format results using ResultFormatter |
| 1262 | formatted_results = ResultFormatter.format_search_results( | 1284 | formatted_results = ResultFormatter.format_search_results( |
search/sku_intent_selector.py
| 1 | """ | 1 | """ |
| 2 | -SKU selection for style-intent-aware search results. | 2 | +SKU selection for style-intent-aware and image-aware search results. |
| 3 | + | ||
| 4 | +Unified algorithm (one pass per hit, no cascading fallback stages): | ||
| 5 | + | ||
| 6 | +1. Per active style intent, a SKU's attribute value for that dimension comes | ||
| 7 | + from ONE of two sources, in priority order: | ||
| 8 | + - ``option``: the SKU's own ``optionN_value`` on the slot resolved by the | ||
| 9 | + intent's dimension aliases — authoritative whenever non-empty. | ||
| 10 | + - ``taxonomy``: the SPU-level ``enriched_taxonomy_attributes`` value on the | ||
| 11 | + same dimension — used only when the SKU has no own value (slot unresolved | ||
| 12 | + or value empty). Never overrides a contradicting SKU-level value. | ||
| 13 | +2. A SKU is "text-matched" iff every active intent finds a match on its | ||
| 14 | + selected value source (tokens of zh/en/attribute synonyms; values are first | ||
| 15 | + passed through ``_with_segment_boundaries_for_matching`` so brackets and | ||
| 16 | + common separators split segments; pure-CJK terms still use a substring | ||
| 17 | + fallback when the value is one undivided CJK run, e.g. ``卡其色棉``). We | ||
| 18 | + remember the matching source and the raw matched | ||
| 19 | + text per intent so the final decision can surface it. | ||
| 20 | +3. The image-pick comes straight from the nested ``image_embedding`` inner_hits | ||
| 21 | + (``exact_image_knn_query_hits`` preferred, ``image_knn_query_hits`` | ||
| 22 | + otherwise): the SKU whose ``image_src`` equals the top-scoring url. | ||
| 23 | +4. Unified selection: | ||
| 24 | + - if the text-matched set is non-empty → pick image_pick when it lies in | ||
| 25 | + that set (visual tie-break among text-matched), otherwise the first | ||
| 26 | + text-matched SKU; | ||
| 27 | + - else → pick image_pick if any; | ||
| 28 | + - else → no decision (``final_source == "none"``). | ||
| 29 | + | ||
| 30 | +``final_source`` values (weakest → strongest text evidence, reversed): | ||
| 31 | + ``option`` > ``taxonomy`` > ``image`` > ``none``. If any intent was satisfied | ||
| 32 | + only via taxonomy the overall source degrades to ``taxonomy`` so downstream | ||
| 33 | + callers can decide whether to differentiate the SPU-level signal from a | ||
| 34 | + true SKU-level option match. | ||
| 35 | + | ||
| 36 | +No embedding fallback, no stage cascade, no score thresholds. | ||
| 3 | """ | 37 | """ |
| 4 | 38 | ||
| 5 | from __future__ import annotations | 39 | from __future__ import annotations |
| 6 | 40 | ||
| 7 | from dataclasses import dataclass, field | 41 | from dataclasses import dataclass, field |
| 8 | from typing import Any, Callable, Dict, List, Optional, Tuple | 42 | from typing import Any, Callable, Dict, List, Optional, Tuple |
| 43 | +from urllib.parse import urlsplit | ||
| 44 | + | ||
| 45 | +from query.style_intent import ( | ||
| 46 | + DetectedStyleIntent, | ||
| 47 | + StyleIntentProfile, | ||
| 48 | + StyleIntentRegistry, | ||
| 49 | +) | ||
| 50 | +from query.tokenization import ( | ||
| 51 | + contains_han_text, | ||
| 52 | + normalize_query_text, | ||
| 53 | + simple_tokenize_query, | ||
| 54 | +) | ||
| 55 | + | ||
| 56 | +import re | ||
| 57 | + | ||
| 58 | +_NON_HAN_RE = re.compile(r"[^\u4e00-\u9fff]") | ||
| 59 | +# Zero-width / BOM (often pasted from Excel or CMS). | ||
| 60 | +_ZW_AND_BOM_RE = re.compile(r"[\u200b-\u200d\ufeff\u2060]") | ||
| 61 | +# Brackets, slashes, and common commerce/list punctuation → segment boundaries so | ||
| 62 | +# tokenization can align intent terms (e.g. 卡其色) with the leading segment of | ||
| 63 | +# 卡其色(无内衬) / 卡其色/常规 / 卡其色·麻 等,without relying only on substring. | ||
| 64 | +_ATTRIBUTE_BOUNDARY_RE = re.compile( | ||
| 65 | + r"[\s\u3000]" # ASCII / ideographic space | ||
| 66 | + r"|[\(\)\[\]\{\}()【】{}〈〉《》「」『』[]「」]" | ||
| 67 | + r"|[/\\||/\︱丨]" | ||
| 68 | + r"|[,,、;;::.。]" | ||
| 69 | + r"|[·•・]" | ||
| 70 | + r"|[~~]" | ||
| 71 | + r"|[+\=#%&*×※]" | ||
| 72 | + r"|[\u2010-\u2015\u2212]" # hyphen, en dash, minus, etc. | ||
| 73 | +) | ||
| 74 | + | ||
| 75 | + | ||
| 76 | +def _is_pure_han(value: str) -> bool: | ||
| 77 | + """True if the string is non-empty and contains only CJK Unified Ideographs.""" | ||
| 78 | + return bool(value) and not _NON_HAN_RE.search(value) | ||
| 79 | + | ||
| 80 | + | ||
| 81 | +def _with_segment_boundaries_for_matching(normalized_value: str) -> str: | ||
| 82 | + """Normalize commerce-style option/taxonomy strings for token matching. | ||
| 83 | + | ||
| 84 | + Inserts word boundaries at brackets and typical separators so | ||
| 85 | + ``simple_tokenize_query`` yields segments like ``['卡其色', '无内衬']`` instead | ||
| 86 | + of one undifferentiated CJK blob when unusual punctuation appears. | ||
| 87 | + """ | ||
| 88 | + if not normalized_value: | ||
| 89 | + return "" | ||
| 90 | + s = _ZW_AND_BOM_RE.sub("", normalized_value) | ||
| 91 | + s = _ATTRIBUTE_BOUNDARY_RE.sub(" ", s) | ||
| 92 | + return " ".join(s.split()) | ||
| 93 | + | ||
| 94 | + | ||
| 95 | +_IMAGE_INNER_HITS_KEYS: Tuple[str, ...] = ( | ||
| 96 | + "exact_image_knn_query_hits", | ||
| 97 | + "image_knn_query_hits", | ||
| 98 | +) | ||
| 9 | 99 | ||
| 10 | -from query.style_intent import StyleIntentProfile, StyleIntentRegistry | ||
| 11 | -from query.tokenization import normalize_query_text, simple_tokenize_query | 100 | + |
| 101 | +@dataclass(frozen=True) | ||
| 102 | +class ImagePick: | ||
| 103 | + sku_id: str | ||
| 104 | + url: str | ||
| 105 | + score: float | ||
| 12 | 106 | ||
| 13 | 107 | ||
| 14 | @dataclass(frozen=True) | 108 | @dataclass(frozen=True) |
| @@ -16,31 +110,52 @@ class SkuSelectionDecision: | @@ -16,31 +110,52 @@ class SkuSelectionDecision: | ||
| 16 | selected_sku_id: Optional[str] | 110 | selected_sku_id: Optional[str] |
| 17 | rerank_suffix: str | 111 | rerank_suffix: str |
| 18 | selected_text: str | 112 | selected_text: str |
| 19 | - matched_stage: str | ||
| 20 | - similarity_score: Optional[float] = None | 113 | + # "option" | "taxonomy" | "image" | "none" |
| 114 | + final_source: str | ||
| 21 | resolved_dimensions: Dict[str, Optional[str]] = field(default_factory=dict) | 115 | resolved_dimensions: Dict[str, Optional[str]] = field(default_factory=dict) |
| 116 | + # Per-intent matching-source breakdown, e.g. {"color": "option", "size": "taxonomy"}. | ||
| 117 | + matched_sources: Dict[str, str] = field(default_factory=dict) | ||
| 118 | + image_pick_sku_id: Optional[str] = None | ||
| 119 | + image_pick_url: Optional[str] = None | ||
| 120 | + image_pick_score: Optional[float] = None | ||
| 121 | + | ||
| 122 | + # Backward-compat alias; some older callers/tests look at ``matched_stage``. | ||
| 123 | + @property | ||
| 124 | + def matched_stage(self) -> str: | ||
| 125 | + return self.final_source | ||
| 22 | 126 | ||
| 23 | def to_dict(self) -> Dict[str, Any]: | 127 | def to_dict(self) -> Dict[str, Any]: |
| 24 | return { | 128 | return { |
| 25 | "selected_sku_id": self.selected_sku_id, | 129 | "selected_sku_id": self.selected_sku_id, |
| 26 | "rerank_suffix": self.rerank_suffix, | 130 | "rerank_suffix": self.rerank_suffix, |
| 27 | "selected_text": self.selected_text, | 131 | "selected_text": self.selected_text, |
| 28 | - "matched_stage": self.matched_stage, | ||
| 29 | - "similarity_score": self.similarity_score, | 132 | + "final_source": self.final_source, |
| 133 | + "matched_sources": dict(self.matched_sources), | ||
| 30 | "resolved_dimensions": dict(self.resolved_dimensions), | 134 | "resolved_dimensions": dict(self.resolved_dimensions), |
| 135 | + "image_pick": ( | ||
| 136 | + { | ||
| 137 | + "sku_id": self.image_pick_sku_id, | ||
| 138 | + "url": self.image_pick_url, | ||
| 139 | + "score": self.image_pick_score, | ||
| 140 | + } | ||
| 141 | + if self.image_pick_sku_id or self.image_pick_url | ||
| 142 | + else None | ||
| 143 | + ), | ||
| 31 | } | 144 | } |
| 32 | 145 | ||
| 33 | 146 | ||
| 34 | @dataclass | 147 | @dataclass |
| 35 | class _SelectionContext: | 148 | class _SelectionContext: |
| 36 | - attribute_terms_by_intent: Dict[str, Tuple[str, ...]] | 149 | + """Request-scoped memo for term tokenization and substring match probes.""" |
| 150 | + | ||
| 151 | + terms_by_intent: Dict[str, Tuple[str, ...]] | ||
| 37 | normalized_text_cache: Dict[str, str] = field(default_factory=dict) | 152 | normalized_text_cache: Dict[str, str] = field(default_factory=dict) |
| 38 | tokenized_text_cache: Dict[str, Tuple[str, ...]] = field(default_factory=dict) | 153 | tokenized_text_cache: Dict[str, Tuple[str, ...]] = field(default_factory=dict) |
| 39 | text_match_cache: Dict[Tuple[str, str], bool] = field(default_factory=dict) | 154 | text_match_cache: Dict[Tuple[str, str], bool] = field(default_factory=dict) |
| 40 | 155 | ||
| 41 | 156 | ||
| 42 | class StyleSkuSelector: | 157 | class StyleSkuSelector: |
| 43 | - """Selects the best SKU for an SPU based on detected style intent.""" | 158 | + """Selects the best SKU per hit from style-intent text match + image KNN.""" |
| 44 | 159 | ||
| 45 | def __init__( | 160 | def __init__( |
| 46 | self, | 161 | self, |
| @@ -49,29 +164,47 @@ class StyleSkuSelector: | @@ -49,29 +164,47 @@ class StyleSkuSelector: | ||
| 49 | text_encoder_getter: Optional[Callable[[], Any]] = None, | 164 | text_encoder_getter: Optional[Callable[[], Any]] = None, |
| 50 | ) -> None: | 165 | ) -> None: |
| 51 | self.registry = registry | 166 | self.registry = registry |
| 167 | + # Retained for API back-compat; no longer used now that embedding fallback is gone. | ||
| 52 | self._text_encoder_getter = text_encoder_getter | 168 | self._text_encoder_getter = text_encoder_getter |
| 53 | 169 | ||
| 170 | + # ------------------------------------------------------------------ | ||
| 171 | + # Public entry points | ||
| 172 | + # ------------------------------------------------------------------ | ||
| 54 | def prepare_hits( | 173 | def prepare_hits( |
| 55 | self, | 174 | self, |
| 56 | es_hits: List[Dict[str, Any]], | 175 | es_hits: List[Dict[str, Any]], |
| 57 | parsed_query: Any, | 176 | parsed_query: Any, |
| 58 | ) -> Dict[str, SkuSelectionDecision]: | 177 | ) -> Dict[str, SkuSelectionDecision]: |
| 178 | + """Compute selection decisions (without mutating ``_source``). | ||
| 179 | + | ||
| 180 | + Runs if either a style intent is active OR any hit carries image | ||
| 181 | + inner_hits. Decisions are keyed by ES ``_id`` and meant to be applied | ||
| 182 | + later via :meth:`apply_precomputed_decisions` (after page fill). | ||
| 183 | + """ | ||
| 59 | decisions: Dict[str, SkuSelectionDecision] = {} | 184 | decisions: Dict[str, SkuSelectionDecision] = {} |
| 60 | style_profile = getattr(parsed_query, "style_intent_profile", None) | 185 | style_profile = getattr(parsed_query, "style_intent_profile", None) |
| 61 | - if not isinstance(style_profile, StyleIntentProfile) or not style_profile.is_active: | ||
| 62 | - return decisions | ||
| 63 | - | ||
| 64 | - selection_context = self._build_selection_context(style_profile) | 186 | + style_active = ( |
| 187 | + isinstance(style_profile, StyleIntentProfile) and style_profile.is_active | ||
| 188 | + ) | ||
| 189 | + selection_context = ( | ||
| 190 | + self._build_selection_context(style_profile) if style_active else None | ||
| 191 | + ) | ||
| 65 | 192 | ||
| 66 | for hit in es_hits: | 193 | for hit in es_hits: |
| 67 | source = hit.get("_source") | 194 | source = hit.get("_source") |
| 68 | if not isinstance(source, dict): | 195 | if not isinstance(source, dict): |
| 69 | continue | 196 | continue |
| 70 | 197 | ||
| 71 | - decision = self._select_for_source( | ||
| 72 | - source, | ||
| 73 | - style_profile=style_profile, | 198 | + image_pick = self._pick_sku_by_image(hit, source) |
| 199 | + if not style_active and image_pick is None: | ||
| 200 | + # Nothing to do for this hit. | ||
| 201 | + continue | ||
| 202 | + | ||
| 203 | + decision = self._select( | ||
| 204 | + source=source, | ||
| 205 | + style_profile=style_profile if style_active else None, | ||
| 74 | selection_context=selection_context, | 206 | selection_context=selection_context, |
| 207 | + image_pick=image_pick, | ||
| 75 | ) | 208 | ) |
| 76 | if decision is None: | 209 | if decision is None: |
| 77 | continue | 210 | continue |
| @@ -94,7 +227,6 @@ class StyleSkuSelector: | @@ -94,7 +227,6 @@ class StyleSkuSelector: | ||
| 94 | ) -> None: | 227 | ) -> None: |
| 95 | if not es_hits or not decisions: | 228 | if not es_hits or not decisions: |
| 96 | return | 229 | return |
| 97 | - | ||
| 98 | for hit in es_hits: | 230 | for hit in es_hits: |
| 99 | doc_id = hit.get("_id") | 231 | doc_id = hit.get("_id") |
| 100 | if doc_id is None: | 232 | if doc_id is None: |
| @@ -111,122 +243,90 @@ class StyleSkuSelector: | @@ -111,122 +243,90 @@ class StyleSkuSelector: | ||
| 111 | else: | 243 | else: |
| 112 | hit.pop("_style_rerank_suffix", None) | 244 | hit.pop("_style_rerank_suffix", None) |
| 113 | 245 | ||
| 246 | + # ------------------------------------------------------------------ | ||
| 247 | + # Selection context & text matching | ||
| 248 | + # ------------------------------------------------------------------ | ||
| 114 | def _build_selection_context( | 249 | def _build_selection_context( |
| 115 | self, | 250 | self, |
| 116 | style_profile: StyleIntentProfile, | 251 | style_profile: StyleIntentProfile, |
| 117 | ) -> _SelectionContext: | 252 | ) -> _SelectionContext: |
| 118 | - attribute_terms_by_intent: Dict[str, List[str]] = {} | 253 | + terms_by_intent: Dict[str, List[str]] = {} |
| 119 | for intent in style_profile.intents: | 254 | for intent in style_profile.intents: |
| 120 | - terms = attribute_terms_by_intent.setdefault(intent.intent_type, []) | ||
| 121 | - for raw_term in intent.attribute_terms: | 255 | + terms = terms_by_intent.setdefault(intent.intent_type, []) |
| 256 | + for raw_term in intent.matching_terms: | ||
| 122 | normalized_term = normalize_query_text(raw_term) | 257 | normalized_term = normalize_query_text(raw_term) |
| 123 | - if not normalized_term or normalized_term in terms: | ||
| 124 | - continue | ||
| 125 | - terms.append(normalized_term) | ||
| 126 | - | 258 | + if normalized_term and normalized_term not in terms: |
| 259 | + terms.append(normalized_term) | ||
| 127 | return _SelectionContext( | 260 | return _SelectionContext( |
| 128 | - attribute_terms_by_intent={ | 261 | + terms_by_intent={ |
| 129 | intent_type: tuple(terms) | 262 | intent_type: tuple(terms) |
| 130 | - for intent_type, terms in attribute_terms_by_intent.items() | 263 | + for intent_type, terms in terms_by_intent.items() |
| 131 | }, | 264 | }, |
| 132 | ) | 265 | ) |
| 133 | 266 | ||
| 134 | - @staticmethod | ||
| 135 | - def _normalize_cached(selection_context: _SelectionContext, value: Any) -> str: | 267 | + def _normalize_cached(self, ctx: _SelectionContext, value: Any) -> str: |
| 136 | raw = str(value or "").strip() | 268 | raw = str(value or "").strip() |
| 137 | if not raw: | 269 | if not raw: |
| 138 | return "" | 270 | return "" |
| 139 | - cached = selection_context.normalized_text_cache.get(raw) | 271 | + cached = ctx.normalized_text_cache.get(raw) |
| 140 | if cached is not None: | 272 | if cached is not None: |
| 141 | return cached | 273 | return cached |
| 142 | normalized = normalize_query_text(raw) | 274 | normalized = normalize_query_text(raw) |
| 143 | - selection_context.normalized_text_cache[raw] = normalized | 275 | + ctx.normalized_text_cache[raw] = normalized |
| 144 | return normalized | 276 | return normalized |
| 145 | 277 | ||
| 146 | - def _resolve_dimensions( | ||
| 147 | - self, | ||
| 148 | - source: Dict[str, Any], | ||
| 149 | - style_profile: StyleIntentProfile, | ||
| 150 | - ) -> Dict[str, Optional[str]]: | ||
| 151 | - option_names = { | ||
| 152 | - "option1_value": normalize_query_text(source.get("option1_name")), | ||
| 153 | - "option2_value": normalize_query_text(source.get("option2_name")), | ||
| 154 | - "option3_value": normalize_query_text(source.get("option3_name")), | ||
| 155 | - } | ||
| 156 | - resolved: Dict[str, Optional[str]] = {} | ||
| 157 | - for intent in style_profile.intents: | ||
| 158 | - if intent.intent_type in resolved: | ||
| 159 | - continue | ||
| 160 | - aliases = set(intent.dimension_aliases or self.registry.get_dimension_aliases(intent.intent_type)) | ||
| 161 | - matched_field = None | ||
| 162 | - for field_name, option_name in option_names.items(): | ||
| 163 | - if option_name and option_name in aliases: | ||
| 164 | - matched_field = field_name | ||
| 165 | - break | ||
| 166 | - resolved[intent.intent_type] = matched_field | ||
| 167 | - return resolved | ||
| 168 | - | ||
| 169 | - @staticmethod | ||
| 170 | - def _empty_decision( | ||
| 171 | - resolved_dimensions: Dict[str, Optional[str]], | ||
| 172 | - matched_stage: str, | ||
| 173 | - ) -> SkuSelectionDecision: | ||
| 174 | - return SkuSelectionDecision( | ||
| 175 | - selected_sku_id=None, | ||
| 176 | - rerank_suffix="", | ||
| 177 | - selected_text="", | ||
| 178 | - matched_stage=matched_stage, | ||
| 179 | - resolved_dimensions=dict(resolved_dimensions), | 278 | + def _tokenize_cached(self, ctx: _SelectionContext, value: str) -> Tuple[str, ...]: |
| 279 | + normalized_value = normalize_query_text(value) | ||
| 280 | + if not normalized_value: | ||
| 281 | + return () | ||
| 282 | + cached = ctx.tokenized_text_cache.get(normalized_value) | ||
| 283 | + if cached is not None: | ||
| 284 | + return cached | ||
| 285 | + tokens = tuple( | ||
| 286 | + normalize_query_text(token) | ||
| 287 | + for token in simple_tokenize_query(normalized_value) | ||
| 288 | + if token | ||
| 180 | ) | 289 | ) |
| 290 | + ctx.tokenized_text_cache[normalized_value] = tokens | ||
| 291 | + return tokens | ||
| 181 | 292 | ||
| 182 | def _is_text_match( | 293 | def _is_text_match( |
| 183 | self, | 294 | self, |
| 184 | intent_type: str, | 295 | intent_type: str, |
| 185 | - selection_context: _SelectionContext, | 296 | + ctx: _SelectionContext, |
| 186 | *, | 297 | *, |
| 187 | normalized_value: str, | 298 | normalized_value: str, |
| 188 | ) -> bool: | 299 | ) -> bool: |
| 300 | + """True iff any intent term token-boundary matches the given value.""" | ||
| 189 | if not normalized_value: | 301 | if not normalized_value: |
| 190 | return False | 302 | return False |
| 191 | - | ||
| 192 | cache_key = (intent_type, normalized_value) | 303 | cache_key = (intent_type, normalized_value) |
| 193 | - cached = selection_context.text_match_cache.get(cache_key) | 304 | + cached = ctx.text_match_cache.get(cache_key) |
| 194 | if cached is not None: | 305 | if cached is not None: |
| 195 | return cached | 306 | return cached |
| 196 | 307 | ||
| 197 | - attribute_terms = selection_context.attribute_terms_by_intent.get(intent_type, ()) | ||
| 198 | - value_tokens = self._tokenize_cached(selection_context, normalized_value) | 308 | + terms = ctx.terms_by_intent.get(intent_type, ()) |
| 309 | + segmented = _with_segment_boundaries_for_matching(normalized_value) | ||
| 310 | + value_tokens = self._tokenize_cached(ctx, segmented) | ||
| 199 | matched = any( | 311 | matched = any( |
| 200 | self._matches_term_tokens( | 312 | self._matches_term_tokens( |
| 201 | term=term, | 313 | term=term, |
| 202 | value_tokens=value_tokens, | 314 | value_tokens=value_tokens, |
| 203 | - selection_context=selection_context, | 315 | + ctx=ctx, |
| 204 | normalized_value=normalized_value, | 316 | normalized_value=normalized_value, |
| 205 | ) | 317 | ) |
| 206 | - for term in attribute_terms | 318 | + for term in terms |
| 207 | if term | 319 | if term |
| 208 | ) | 320 | ) |
| 209 | - selection_context.text_match_cache[cache_key] = matched | 321 | + ctx.text_match_cache[cache_key] = matched |
| 210 | return matched | 322 | return matched |
| 211 | 323 | ||
| 212 | - @staticmethod | ||
| 213 | - def _tokenize_cached(selection_context: _SelectionContext, value: str) -> Tuple[str, ...]: | ||
| 214 | - normalized_value = normalize_query_text(value) | ||
| 215 | - if not normalized_value: | ||
| 216 | - return () | ||
| 217 | - cached = selection_context.tokenized_text_cache.get(normalized_value) | ||
| 218 | - if cached is not None: | ||
| 219 | - return cached | ||
| 220 | - tokens = tuple(normalize_query_text(token) for token in simple_tokenize_query(normalized_value) if token) | ||
| 221 | - selection_context.tokenized_text_cache[normalized_value] = tokens | ||
| 222 | - return tokens | ||
| 223 | - | ||
| 224 | def _matches_term_tokens( | 324 | def _matches_term_tokens( |
| 225 | self, | 325 | self, |
| 226 | *, | 326 | *, |
| 227 | term: str, | 327 | term: str, |
| 228 | value_tokens: Tuple[str, ...], | 328 | value_tokens: Tuple[str, ...], |
| 229 | - selection_context: _SelectionContext, | 329 | + ctx: _SelectionContext, |
| 230 | normalized_value: str, | 330 | normalized_value: str, |
| 231 | ) -> bool: | 331 | ) -> bool: |
| 232 | normalized_term = normalize_query_text(term) | 332 | normalized_term = normalize_query_text(term) |
| @@ -234,8 +334,13 @@ class StyleSkuSelector: | @@ -234,8 +334,13 @@ class StyleSkuSelector: | ||
| 234 | return False | 334 | return False |
| 235 | if normalized_term == normalized_value: | 335 | if normalized_term == normalized_value: |
| 236 | return True | 336 | return True |
| 237 | - | ||
| 238 | - term_tokens = self._tokenize_cached(selection_context, normalized_term) | 337 | + # Pure-CJK terms can't be split further by the whitespace/regex tokenizer |
| 338 | + # ("卡其色棉" is one token), so sliding-window token match would miss the prefix. | ||
| 339 | + # Fall back to normalized substring containment — safe because this branch | ||
| 340 | + # never triggers for Latin tokens where substring would cause "l" ⊂ "xl" issues. | ||
| 341 | + if _is_pure_han(normalized_term) and contains_han_text(normalized_value): | ||
| 342 | + return normalized_term in normalized_value | ||
| 343 | + term_tokens = self._tokenize_cached(ctx, normalized_term) | ||
| 239 | if not term_tokens or not value_tokens: | 344 | if not term_tokens or not value_tokens: |
| 240 | return normalized_term in normalized_value | 345 | return normalized_term in normalized_value |
| 241 | 346 | ||
| @@ -243,106 +348,383 @@ class StyleSkuSelector: | @@ -243,106 +348,383 @@ class StyleSkuSelector: | ||
| 243 | value_length = len(value_tokens) | 348 | value_length = len(value_tokens) |
| 244 | if term_length > value_length: | 349 | if term_length > value_length: |
| 245 | return False | 350 | return False |
| 246 | - | ||
| 247 | for start in range(value_length - term_length + 1): | 351 | for start in range(value_length - term_length + 1): |
| 248 | - if value_tokens[start:start + term_length] == term_tokens: | 352 | + if value_tokens[start : start + term_length] == term_tokens: |
| 249 | return True | 353 | return True |
| 250 | return False | 354 | return False |
| 251 | 355 | ||
| 252 | - def _find_first_text_match( | 356 | + # ------------------------------------------------------------------ |
| 357 | + # Dimension resolution (option slot + taxonomy values) | ||
| 358 | + # ------------------------------------------------------------------ | ||
| 359 | + def _resolve_dimensions( | ||
| 253 | self, | 360 | self, |
| 254 | - skus: List[Dict[str, Any]], | ||
| 255 | - resolved_dimensions: Dict[str, Optional[str]], | ||
| 256 | - selection_context: _SelectionContext, | ||
| 257 | - ) -> Optional[Tuple[str, str]]: | ||
| 258 | - for sku in skus: | ||
| 259 | - selection_parts: List[str] = [] | ||
| 260 | - seen_parts: set[str] = set() | ||
| 261 | - matched = True | ||
| 262 | - | ||
| 263 | - for intent_type, field_name in resolved_dimensions.items(): | ||
| 264 | - if not field_name: | ||
| 265 | - matched = False | ||
| 266 | - break | ||
| 267 | - | ||
| 268 | - raw_value = str(sku.get(field_name) or "").strip() | ||
| 269 | - normalized_value = self._normalize_cached(selection_context, raw_value) | ||
| 270 | - if not self._is_text_match( | ||
| 271 | - intent_type, | ||
| 272 | - selection_context, | ||
| 273 | - normalized_value=normalized_value, | ||
| 274 | - ): | ||
| 275 | - matched = False | 361 | + source: Dict[str, Any], |
| 362 | + style_profile: StyleIntentProfile, | ||
| 363 | + ) -> Dict[str, Optional[str]]: | ||
| 364 | + option_fields = ( | ||
| 365 | + ("option1_value", source.get("option1_name")), | ||
| 366 | + ("option2_value", source.get("option2_name")), | ||
| 367 | + ("option3_value", source.get("option3_name")), | ||
| 368 | + ) | ||
| 369 | + option_aliases = [ | ||
| 370 | + (field_name, normalize_query_text(name)) | ||
| 371 | + for field_name, name in option_fields | ||
| 372 | + ] | ||
| 373 | + resolved: Dict[str, Optional[str]] = {} | ||
| 374 | + for intent in style_profile.intents: | ||
| 375 | + if intent.intent_type in resolved: | ||
| 376 | + continue | ||
| 377 | + aliases = set( | ||
| 378 | + intent.dimension_aliases | ||
| 379 | + or self.registry.get_dimension_aliases(intent.intent_type) | ||
| 380 | + ) | ||
| 381 | + matched_field: Optional[str] = None | ||
| 382 | + for field_name, option_name in option_aliases: | ||
| 383 | + if option_name and option_name in aliases: | ||
| 384 | + matched_field = field_name | ||
| 276 | break | 385 | break |
| 386 | + resolved[intent.intent_type] = matched_field | ||
| 387 | + return resolved | ||
| 277 | 388 | ||
| 278 | - if raw_value and normalized_value not in seen_parts: | ||
| 279 | - seen_parts.add(normalized_value) | ||
| 280 | - selection_parts.append(raw_value) | 389 | + def _collect_taxonomy_values( |
| 390 | + self, | ||
| 391 | + source: Dict[str, Any], | ||
| 392 | + style_profile: StyleIntentProfile, | ||
| 393 | + ) -> Dict[str, Tuple[Tuple[str, str], ...]]: | ||
| 394 | + """Extract SPU-level enriched_taxonomy_attributes values per intent dimension. | ||
| 395 | + | ||
| 396 | + Returns a mapping ``intent_type -> ((normalized, raw), ...)`` so the | ||
| 397 | + selection layer can (a) match against ``normalized`` and (b) surface | ||
| 398 | + the human-readable ``raw`` form in ``selected_text``. | ||
| 399 | + """ | ||
| 400 | + attrs = source.get("enriched_taxonomy_attributes") | ||
| 401 | + if not isinstance(attrs, list) or not attrs: | ||
| 402 | + return {} | ||
| 403 | + aliases_by_intent = { | ||
| 404 | + intent.intent_type: set( | ||
| 405 | + intent.dimension_aliases | ||
| 406 | + or self.registry.get_dimension_aliases(intent.intent_type) | ||
| 407 | + ) | ||
| 408 | + for intent in style_profile.intents | ||
| 409 | + } | ||
| 410 | + values_by_intent: Dict[str, List[Tuple[str, str]]] = { | ||
| 411 | + t: [] for t in aliases_by_intent | ||
| 412 | + } | ||
| 413 | + for attr in attrs: | ||
| 414 | + if not isinstance(attr, dict): | ||
| 415 | + continue | ||
| 416 | + attr_name = normalize_query_text(attr.get("name")) | ||
| 417 | + if not attr_name: | ||
| 418 | + continue | ||
| 419 | + matching_intents = [ | ||
| 420 | + t for t, aliases in aliases_by_intent.items() if attr_name in aliases | ||
| 421 | + ] | ||
| 422 | + if not matching_intents: | ||
| 423 | + continue | ||
| 424 | + for raw_text in _iter_multilingual_texts(attr.get("value")): | ||
| 425 | + raw = str(raw_text).strip() | ||
| 426 | + if not raw: | ||
| 427 | + continue | ||
| 428 | + normalized = normalize_query_text(raw) | ||
| 429 | + if not normalized: | ||
| 430 | + continue | ||
| 431 | + for intent_type in matching_intents: | ||
| 432 | + bucket = values_by_intent[intent_type] | ||
| 433 | + if not any(existing_norm == normalized for existing_norm, _ in bucket): | ||
| 434 | + bucket.append((normalized, raw)) | ||
| 435 | + return {t: tuple(v) for t, v in values_by_intent.items() if v} | ||
| 436 | + | ||
| 437 | + # ------------------------------------------------------------------ | ||
| 438 | + # Image pick | ||
| 439 | + # ------------------------------------------------------------------ | ||
| 440 | + @staticmethod | ||
| 441 | + def _normalize_url(url: Any) -> str: | ||
| 442 | + raw = str(url or "").strip() | ||
| 443 | + if not raw: | ||
| 444 | + return "" | ||
| 445 | + # Accept protocol-relative URLs like "//cdn/..." or full URLs. | ||
| 446 | + if raw.startswith("//"): | ||
| 447 | + raw = "https:" + raw | ||
| 448 | + try: | ||
| 449 | + parts = urlsplit(raw) | ||
| 450 | + except ValueError: | ||
| 451 | + return raw.casefold() | ||
| 452 | + host = (parts.netloc or "").casefold() | ||
| 453 | + path = parts.path or "" | ||
| 454 | + return f"{host}{path}".casefold() | ||
| 455 | + | ||
| 456 | + def _pick_sku_by_image( | ||
| 457 | + self, | ||
| 458 | + hit: Dict[str, Any], | ||
| 459 | + source: Dict[str, Any], | ||
| 460 | + ) -> Optional[ImagePick]: | ||
| 461 | + inner_hits = hit.get("inner_hits") | ||
| 462 | + if not isinstance(inner_hits, dict): | ||
| 463 | + return None | ||
| 464 | + top_url: Optional[str] = None | ||
| 465 | + top_score: Optional[float] = None | ||
| 466 | + for key in _IMAGE_INNER_HITS_KEYS: | ||
| 467 | + payload = inner_hits.get(key) | ||
| 468 | + if not isinstance(payload, dict): | ||
| 469 | + continue | ||
| 470 | + hits_block = payload.get("hits") | ||
| 471 | + inner_list = hits_block.get("hits") if isinstance(hits_block, dict) else None | ||
| 472 | + if not isinstance(inner_list, list) or not inner_list: | ||
| 473 | + continue | ||
| 474 | + for entry in inner_list: | ||
| 475 | + if not isinstance(entry, dict): | ||
| 476 | + continue | ||
| 477 | + url = (entry.get("_source") or {}).get("url") | ||
| 478 | + if not url: | ||
| 479 | + continue | ||
| 480 | + try: | ||
| 481 | + score = float(entry.get("_score") or 0.0) | ||
| 482 | + except (TypeError, ValueError): | ||
| 483 | + score = 0.0 | ||
| 484 | + if top_score is None or score > top_score: | ||
| 485 | + top_url = str(url) | ||
| 486 | + top_score = score | ||
| 487 | + if top_url is not None: | ||
| 488 | + break # Prefer the first listed inner_hits source (exact > approx). | ||
| 489 | + if top_url is None: | ||
| 490 | + return None | ||
| 281 | 491 | ||
| 282 | - if matched: | ||
| 283 | - return str(sku.get("sku_id") or ""), " ".join(selection_parts).strip() | 492 | + skus = source.get("skus") |
| 493 | + if not isinstance(skus, list): | ||
| 494 | + return None | ||
| 495 | + target = self._normalize_url(top_url) | ||
| 496 | + for sku in skus: | ||
| 497 | + sku_url = self._normalize_url(sku.get("image_src") or sku.get("imageSrc")) | ||
| 498 | + if sku_url and sku_url == target: | ||
| 499 | + return ImagePick( | ||
| 500 | + sku_id=str(sku.get("sku_id") or ""), | ||
| 501 | + url=top_url, | ||
| 502 | + score=float(top_score or 0.0), | ||
| 503 | + ) | ||
| 284 | return None | 504 | return None |
| 285 | 505 | ||
| 286 | - def _select_for_source( | 506 | + # ------------------------------------------------------------------ |
| 507 | + # Unified per-hit selection | ||
| 508 | + # ------------------------------------------------------------------ | ||
| 509 | + def _select( | ||
| 287 | self, | 510 | self, |
| 288 | - source: Dict[str, Any], | ||
| 289 | *, | 511 | *, |
| 290 | - style_profile: StyleIntentProfile, | ||
| 291 | - selection_context: _SelectionContext, | 512 | + source: Dict[str, Any], |
| 513 | + style_profile: Optional[StyleIntentProfile], | ||
| 514 | + selection_context: Optional[_SelectionContext], | ||
| 515 | + image_pick: Optional[ImagePick], | ||
| 292 | ) -> Optional[SkuSelectionDecision]: | 516 | ) -> Optional[SkuSelectionDecision]: |
| 293 | skus = source.get("skus") | 517 | skus = source.get("skus") |
| 294 | if not isinstance(skus, list) or not skus: | 518 | if not isinstance(skus, list) or not skus: |
| 295 | return None | 519 | return None |
| 296 | 520 | ||
| 297 | - resolved_dimensions = self._resolve_dimensions(source, style_profile) | ||
| 298 | - if not resolved_dimensions or any(not field_name for field_name in resolved_dimensions.values()): | ||
| 299 | - return self._empty_decision(resolved_dimensions, matched_stage="unresolved") | 521 | + resolved_dimensions: Dict[str, Optional[str]] = {} |
| 522 | + text_matched: List[Tuple[Dict[str, Any], Dict[str, Tuple[str, str]]]] = [] | ||
| 523 | + | ||
| 524 | + if style_profile is not None and selection_context is not None: | ||
| 525 | + resolved_dimensions = self._resolve_dimensions(source, style_profile) | ||
| 526 | + taxonomy_values = self._collect_taxonomy_values(source, style_profile) | ||
| 527 | + # Only attempt text match when there is at least one value source | ||
| 528 | + # per intent (SKU option or SPU taxonomy). | ||
| 529 | + if all( | ||
| 530 | + resolved_dimensions.get(intent.intent_type) is not None | ||
| 531 | + or taxonomy_values.get(intent.intent_type) | ||
| 532 | + for intent in style_profile.intents | ||
| 533 | + ): | ||
| 534 | + text_matched = self._find_text_matched_skus( | ||
| 535 | + skus=skus, | ||
| 536 | + style_profile=style_profile, | ||
| 537 | + resolved_dimensions=resolved_dimensions, | ||
| 538 | + taxonomy_values=taxonomy_values, | ||
| 539 | + ctx=selection_context, | ||
| 540 | + ) | ||
| 541 | + | ||
| 542 | + selected_sku_id: Optional[str] = None | ||
| 543 | + selected_text = "" | ||
| 544 | + final_source = "none" | ||
| 545 | + matched_sources: Dict[str, str] = {} | ||
| 546 | + | ||
| 547 | + if text_matched: | ||
| 548 | + chosen_sku, per_intent = self._choose_among_text_matched( | ||
| 549 | + text_matched, image_pick | ||
| 550 | + ) | ||
| 551 | + selected_sku_id = str(chosen_sku.get("sku_id") or "") or None | ||
| 552 | + selected_text = self._text_from_matches(per_intent) | ||
| 553 | + matched_sources = { | ||
| 554 | + intent_type: src for intent_type, (src, _) in per_intent.items() | ||
| 555 | + } | ||
| 556 | + final_source = ( | ||
| 557 | + "taxonomy" if "taxonomy" in matched_sources.values() else "option" | ||
| 558 | + ) | ||
| 559 | + elif image_pick is not None: | ||
| 560 | + image_sku = self._find_sku_by_id(skus, image_pick.sku_id) | ||
| 561 | + if image_sku is not None: | ||
| 562 | + selected_sku_id = image_pick.sku_id or None | ||
| 563 | + selected_text = self._build_selected_text(image_sku, resolved_dimensions) | ||
| 564 | + final_source = "image" | ||
| 300 | 565 | ||
| 301 | - text_match = self._find_first_text_match(skus, resolved_dimensions, selection_context) | ||
| 302 | - if text_match is None: | ||
| 303 | - return self._empty_decision(resolved_dimensions, matched_stage="no_match") | ||
| 304 | - return self._build_decision( | ||
| 305 | - selected_sku_id=text_match[0], | ||
| 306 | - selected_text=text_match[1], | 566 | + return SkuSelectionDecision( |
| 567 | + selected_sku_id=selected_sku_id, | ||
| 568 | + rerank_suffix=selected_text, | ||
| 569 | + selected_text=selected_text, | ||
| 570 | + final_source=final_source, | ||
| 307 | resolved_dimensions=resolved_dimensions, | 571 | resolved_dimensions=resolved_dimensions, |
| 308 | - matched_stage="text", | 572 | + matched_sources=matched_sources, |
| 573 | + image_pick_sku_id=(image_pick.sku_id or None) if image_pick else None, | ||
| 574 | + image_pick_url=image_pick.url if image_pick else None, | ||
| 575 | + image_pick_score=image_pick.score if image_pick else None, | ||
| 309 | ) | 576 | ) |
| 310 | 577 | ||
| 311 | - @staticmethod | ||
| 312 | - def _build_decision( | ||
| 313 | - selected_sku_id: str, | ||
| 314 | - selected_text: str, | ||
| 315 | - resolved_dimensions: Dict[str, Optional[str]], | 578 | + def _find_text_matched_skus( |
| 579 | + self, | ||
| 316 | *, | 580 | *, |
| 317 | - matched_stage: str, | ||
| 318 | - similarity_score: Optional[float] = None, | ||
| 319 | - ) -> SkuSelectionDecision: | ||
| 320 | - return SkuSelectionDecision( | ||
| 321 | - selected_sku_id=selected_sku_id or None, | ||
| 322 | - rerank_suffix=str(selected_text or "").strip(), | ||
| 323 | - selected_text=str(selected_text or "").strip(), | ||
| 324 | - matched_stage=matched_stage, | ||
| 325 | - similarity_score=similarity_score, | ||
| 326 | - resolved_dimensions=dict(resolved_dimensions), | ||
| 327 | - ) | 581 | + skus: List[Dict[str, Any]], |
| 582 | + style_profile: StyleIntentProfile, | ||
| 583 | + resolved_dimensions: Dict[str, Optional[str]], | ||
| 584 | + taxonomy_values: Dict[str, Tuple[Tuple[str, str], ...]], | ||
| 585 | + ctx: _SelectionContext, | ||
| 586 | + ) -> List[Tuple[Dict[str, Any], Dict[str, Tuple[str, str]]]]: | ||
| 587 | + """Return every SKU that satisfies every active intent, with match meta. | ||
| 588 | + | ||
| 589 | + Authority rule per intent: | ||
| 590 | + - If the SKU has a non-empty value on the resolved option slot, that | ||
| 591 | + value ALONE decides the match (source = ``option``). Taxonomy cannot | ||
| 592 | + override a contradicting SKU-level value. | ||
| 593 | + - Only when the SKU has no own value on the dimension (slot unresolved | ||
| 594 | + or value empty) does the SPU-level taxonomy serve as the fallback | ||
| 595 | + value source (source = ``taxonomy``). | ||
| 596 | + | ||
| 597 | + For each matched SKU we also return a per-intent dict mapping | ||
| 598 | + ``intent_type -> (source, raw_matched_text)`` so the final decision can | ||
| 599 | + surface the genuinely matched string in ``selected_text`` / | ||
| 600 | + ``rerank_suffix`` rather than, e.g., a SKU's unrelated option value. | ||
| 601 | + """ | ||
| 602 | + matched: List[Tuple[Dict[str, Any], Dict[str, Tuple[str, str]]]] = [] | ||
| 603 | + for sku in skus: | ||
| 604 | + per_intent: Dict[str, Tuple[str, str]] = {} | ||
| 605 | + all_ok = True | ||
| 606 | + for intent in style_profile.intents: | ||
| 607 | + slot = resolved_dimensions.get(intent.intent_type) | ||
| 608 | + sku_raw = str(sku.get(slot) or "").strip() if slot else "" | ||
| 609 | + sku_norm = normalize_query_text(sku_raw) if sku_raw else "" | ||
| 610 | + | ||
| 611 | + if sku_norm: | ||
| 612 | + if self._is_text_match( | ||
| 613 | + intent.intent_type, ctx, normalized_value=sku_norm | ||
| 614 | + ): | ||
| 615 | + per_intent[intent.intent_type] = ("option", sku_raw) | ||
| 616 | + else: | ||
| 617 | + all_ok = False | ||
| 618 | + break | ||
| 619 | + else: | ||
| 620 | + matched_raw: Optional[str] = None | ||
| 621 | + for tax_norm, tax_raw in taxonomy_values.get( | ||
| 622 | + intent.intent_type, () | ||
| 623 | + ): | ||
| 624 | + if self._is_text_match( | ||
| 625 | + intent.intent_type, ctx, normalized_value=tax_norm | ||
| 626 | + ): | ||
| 627 | + matched_raw = tax_raw | ||
| 628 | + break | ||
| 629 | + if matched_raw is None: | ||
| 630 | + all_ok = False | ||
| 631 | + break | ||
| 632 | + per_intent[intent.intent_type] = ("taxonomy", matched_raw) | ||
| 633 | + if all_ok: | ||
| 634 | + matched.append((sku, per_intent)) | ||
| 635 | + return matched | ||
| 636 | + | ||
| 637 | + @staticmethod | ||
| 638 | + def _choose_among_text_matched( | ||
| 639 | + text_matched: List[Tuple[Dict[str, Any], Dict[str, Tuple[str, str]]]], | ||
| 640 | + image_pick: Optional[ImagePick], | ||
| 641 | + ) -> Tuple[Dict[str, Any], Dict[str, Tuple[str, str]]]: | ||
| 642 | + """Image-visual tie-break inside the text-matched set; else first match.""" | ||
| 643 | + if image_pick and image_pick.sku_id: | ||
| 644 | + for sku, per_intent in text_matched: | ||
| 645 | + if str(sku.get("sku_id") or "") == image_pick.sku_id: | ||
| 646 | + return sku, per_intent | ||
| 647 | + return text_matched[0] | ||
| 648 | + | ||
| 649 | + @staticmethod | ||
| 650 | + def _text_from_matches(per_intent: Dict[str, Tuple[str, str]]) -> str: | ||
| 651 | + """Join the genuinely matched raw strings in intent declaration order.""" | ||
| 652 | + parts: List[str] = [] | ||
| 653 | + seen: set[str] = set() | ||
| 654 | + for _, raw in per_intent.values(): | ||
| 655 | + if raw and raw not in seen: | ||
| 656 | + seen.add(raw) | ||
| 657 | + parts.append(raw) | ||
| 658 | + return " ".join(parts).strip() | ||
| 328 | 659 | ||
| 329 | @staticmethod | 660 | @staticmethod |
| 330 | - def _apply_decision_to_source(source: Dict[str, Any], decision: SkuSelectionDecision) -> None: | 661 | + def _find_sku_by_id( |
| 662 | + skus: List[Dict[str, Any]], sku_id: Optional[str] | ||
| 663 | + ) -> Optional[Dict[str, Any]]: | ||
| 664 | + if not sku_id: | ||
| 665 | + return None | ||
| 666 | + for sku in skus: | ||
| 667 | + if str(sku.get("sku_id") or "") == sku_id: | ||
| 668 | + return sku | ||
| 669 | + return None | ||
| 670 | + | ||
| 671 | + @staticmethod | ||
| 672 | + def _build_selected_text( | ||
| 673 | + sku: Dict[str, Any], | ||
| 674 | + resolved_dimensions: Dict[str, Optional[str]], | ||
| 675 | + ) -> str: | ||
| 676 | + """Text carried into rerank doc suffix: joined raw values on the resolved slots.""" | ||
| 677 | + parts: List[str] = [] | ||
| 678 | + seen: set[str] = set() | ||
| 679 | + for slot in resolved_dimensions.values(): | ||
| 680 | + if not slot: | ||
| 681 | + continue | ||
| 682 | + raw = str(sku.get(slot) or "").strip() | ||
| 683 | + if raw and raw not in seen: | ||
| 684 | + seen.add(raw) | ||
| 685 | + parts.append(raw) | ||
| 686 | + return " ".join(parts).strip() | ||
| 687 | + | ||
| 688 | + # ------------------------------------------------------------------ | ||
| 689 | + # Source mutation (applied after page fill) | ||
| 690 | + # ------------------------------------------------------------------ | ||
| 691 | + @staticmethod | ||
| 692 | + def _apply_decision_to_source( | ||
| 693 | + source: Dict[str, Any], decision: SkuSelectionDecision | ||
| 694 | + ) -> None: | ||
| 695 | + if not decision.selected_sku_id: | ||
| 696 | + return | ||
| 331 | skus = source.get("skus") | 697 | skus = source.get("skus") |
| 332 | - if not isinstance(skus, list) or not skus or not decision.selected_sku_id: | 698 | + if not isinstance(skus, list) or not skus: |
| 333 | return | 699 | return |
| 334 | - | ||
| 335 | - selected_index = None | 700 | + selected_index: Optional[int] = None |
| 336 | for index, sku in enumerate(skus): | 701 | for index, sku in enumerate(skus): |
| 337 | if str(sku.get("sku_id") or "") == decision.selected_sku_id: | 702 | if str(sku.get("sku_id") or "") == decision.selected_sku_id: |
| 338 | selected_index = index | 703 | selected_index = index |
| 339 | break | 704 | break |
| 340 | if selected_index is None: | 705 | if selected_index is None: |
| 341 | return | 706 | return |
| 342 | - | ||
| 343 | selected_sku = skus.pop(selected_index) | 707 | selected_sku = skus.pop(selected_index) |
| 344 | skus.insert(0, selected_sku) | 708 | skus.insert(0, selected_sku) |
| 345 | - | ||
| 346 | image_src = selected_sku.get("image_src") or selected_sku.get("imageSrc") | 709 | image_src = selected_sku.get("image_src") or selected_sku.get("imageSrc") |
| 347 | if image_src: | 710 | if image_src: |
| 348 | source["image_url"] = image_src | 711 | source["image_url"] = image_src |
| 712 | + | ||
| 713 | + | ||
| 714 | +def _iter_multilingual_texts(value: Any) -> List[str]: | ||
| 715 | + """Flatten a value that may be str, list, or multilingual dict {zh, en, ...}.""" | ||
| 716 | + if value is None: | ||
| 717 | + return [] | ||
| 718 | + if isinstance(value, str): | ||
| 719 | + return [value] if value.strip() else [] | ||
| 720 | + if isinstance(value, dict): | ||
| 721 | + out: List[str] = [] | ||
| 722 | + for v in value.values(): | ||
| 723 | + out.extend(_iter_multilingual_texts(v)) | ||
| 724 | + return out | ||
| 725 | + if isinstance(value, (list, tuple)): | ||
| 726 | + out = [] | ||
| 727 | + for v in value: | ||
| 728 | + out.extend(_iter_multilingual_texts(v)) | ||
| 729 | + return out | ||
| 730 | + return [] |
tests/test_search_rerank_window.py
| @@ -231,19 +231,6 @@ def _build_searcher(config: SearchConfig, es_client: _FakeESClient) -> Searcher: | @@ -231,19 +231,6 @@ def _build_searcher(config: SearchConfig, es_client: _FakeESClient) -> Searcher: | ||
| 231 | return searcher | 231 | return searcher |
| 232 | 232 | ||
| 233 | 233 | ||
| 234 | -class _FakeTextEncoder: | ||
| 235 | - def __init__(self, vectors: Dict[str, List[float]]): | ||
| 236 | - self.vectors = { | ||
| 237 | - key: np.array(value, dtype=np.float32) | ||
| 238 | - for key, value in vectors.items() | ||
| 239 | - } | ||
| 240 | - | ||
| 241 | - def encode(self, sentences, priority: int = 0, **kwargs): | ||
| 242 | - if isinstance(sentences, str): | ||
| 243 | - sentences = [sentences] | ||
| 244 | - return np.array([self.vectors[text] for text in sentences], dtype=object) | ||
| 245 | - | ||
| 246 | - | ||
| 247 | def test_config_loader_rerank_enabled_defaults_true(tmp_path: Path): | 234 | def test_config_loader_rerank_enabled_defaults_true(tmp_path: Path): |
| 248 | config_data = { | 235 | config_data = { |
| 249 | "es_index_name": "test_products", | 236 | "es_index_name": "test_products", |
| @@ -611,7 +598,14 @@ def test_searcher_rerank_prefetch_source_includes_sku_fields_when_style_intent_a | @@ -611,7 +598,14 @@ def test_searcher_rerank_prefetch_source_includes_sku_fields_when_style_intent_a | ||
| 611 | 598 | ||
| 612 | assert es_client.calls[0]["body"]["_source"] is False | 599 | assert es_client.calls[0]["body"]["_source"] is False |
| 613 | assert es_client.calls[1]["body"]["_source"] == { | 600 | assert es_client.calls[1]["body"]["_source"] == { |
| 614 | - "includes": ["option1_name", "option2_name", "option3_name", "skus", "title"] | 601 | + "includes": [ |
| 602 | + "enriched_taxonomy_attributes", | ||
| 603 | + "option1_name", | ||
| 604 | + "option2_name", | ||
| 605 | + "option3_name", | ||
| 606 | + "skus", | ||
| 607 | + "title", | ||
| 608 | + ] | ||
| 615 | } | 609 | } |
| 616 | 610 | ||
| 617 | 611 | ||
| @@ -944,78 +938,6 @@ def test_searcher_skips_sku_selection_when_option_name_does_not_match_dimension_ | @@ -944,78 +938,6 @@ def test_searcher_skips_sku_selection_when_option_name_does_not_match_dimension_ | ||
| 944 | assert result.results[0].image_url == "https://img/default.jpg" | 938 | assert result.results[0].image_url == "https://img/default.jpg" |
| 945 | 939 | ||
| 946 | 940 | ||
| 947 | -def test_searcher_promotes_sku_by_embedding_when_query_has_no_direct_option_match(monkeypatch): | ||
| 948 | - es_client = _FakeESClient(total_hits=1) | ||
| 949 | - searcher = _build_searcher(_build_search_config(rerank_enabled=False), es_client) | ||
| 950 | - context = create_request_context(reqid="sku-embed", uid="u-sku-embed") | ||
| 951 | - | ||
| 952 | - monkeypatch.setattr( | ||
| 953 | - "search.searcher.get_tenant_config_loader", | ||
| 954 | - lambda: SimpleNamespace(get_tenant_config=lambda tenant_id: {"index_languages": ["en"]}), | ||
| 955 | - ) | ||
| 956 | - | ||
| 957 | - encoder = _FakeTextEncoder( | ||
| 958 | - { | ||
| 959 | - "linen summer dress": [0.8, 0.2], | ||
| 960 | - "red": [1.0, 0.0], | ||
| 961 | - "blue": [0.0, 1.0], | ||
| 962 | - } | ||
| 963 | - ) | ||
| 964 | - | ||
| 965 | - class _EmbeddingQueryParser: | ||
| 966 | - text_encoder = encoder | ||
| 967 | - | ||
| 968 | - def parse( | ||
| 969 | - self, | ||
| 970 | - query: str, | ||
| 971 | - tenant_id: str, | ||
| 972 | - generate_vector: bool, | ||
| 973 | - context: Any, | ||
| 974 | - target_languages: Any = None, | ||
| 975 | - ): | ||
| 976 | - return _FakeParsedQuery( | ||
| 977 | - original_query=query, | ||
| 978 | - query_normalized=query, | ||
| 979 | - rewritten_query=query, | ||
| 980 | - translations={}, | ||
| 981 | - query_vector=np.array([0.0, 1.0], dtype=np.float32), | ||
| 982 | - style_intent_profile=_build_style_intent_profile( | ||
| 983 | - "color", "blue", "color", "colors", "颜色" | ||
| 984 | - ), | ||
| 985 | - ) | ||
| 986 | - | ||
| 987 | - searcher.query_parser = _EmbeddingQueryParser() | ||
| 988 | - | ||
| 989 | - def _full_source_with_skus(doc_id: str) -> Dict[str, Any]: | ||
| 990 | - return { | ||
| 991 | - "spu_id": doc_id, | ||
| 992 | - "title": {"en": f"product-{doc_id}"}, | ||
| 993 | - "brief": {"en": f"brief-{doc_id}"}, | ||
| 994 | - "vendor": {"en": f"vendor-{doc_id}"}, | ||
| 995 | - "option1_name": "Color", | ||
| 996 | - "image_url": "https://img/default.jpg", | ||
| 997 | - "skus": [ | ||
| 998 | - {"sku_id": "sku-red", "option1_value": "Red", "image_src": "https://img/red.jpg"}, | ||
| 999 | - {"sku_id": "sku-blue", "option1_value": "Blue", "image_src": "https://img/blue.jpg"}, | ||
| 1000 | - ], | ||
| 1001 | - } | ||
| 1002 | - | ||
| 1003 | - monkeypatch.setattr(_FakeESClient, "_full_source", staticmethod(_full_source_with_skus)) | ||
| 1004 | - | ||
| 1005 | - result = searcher.search( | ||
| 1006 | - query="linen summer dress", | ||
| 1007 | - tenant_id="162", | ||
| 1008 | - from_=0, | ||
| 1009 | - size=1, | ||
| 1010 | - context=context, | ||
| 1011 | - enable_rerank=False, | ||
| 1012 | - ) | ||
| 1013 | - | ||
| 1014 | - assert len(result.results) == 1 | ||
| 1015 | - assert result.results[0].skus[0].sku_id == "sku-blue" | ||
| 1016 | - assert result.results[0].image_url == "https://img/blue.jpg" | ||
| 1017 | - | ||
| 1018 | - | ||
| 1019 | def test_searcher_debug_info_uses_initial_es_max_score_for_normalization(monkeypatch): | 941 | def test_searcher_debug_info_uses_initial_es_max_score_for_normalization(monkeypatch): |
| 1020 | es_client = _FakeESClient(total_hits=3) | 942 | es_client = _FakeESClient(total_hits=3) |
| 1021 | cfg = _build_search_config(rerank_enabled=False) | 943 | cfg = _build_search_config(rerank_enabled=False) |
tests/test_sku_intent_selector.py
| 1 | from types import SimpleNamespace | 1 | from types import SimpleNamespace |
| 2 | 2 | ||
| 3 | +import pytest | ||
| 4 | + | ||
| 3 | from config import QueryConfig | 5 | from config import QueryConfig |
| 4 | from query.style_intent import DetectedStyleIntent, StyleIntentProfile, StyleIntentRegistry | 6 | from query.style_intent import DetectedStyleIntent, StyleIntentProfile, StyleIntentRegistry |
| 5 | from search.sku_intent_selector import StyleSkuSelector | 7 | from search.sku_intent_selector import StyleSkuSelector |
| @@ -57,7 +59,9 @@ def test_style_sku_selector_matches_first_sku_by_attribute_terms(): | @@ -57,7 +59,9 @@ def test_style_sku_selector_matches_first_sku_by_attribute_terms(): | ||
| 57 | 59 | ||
| 58 | assert decision.selected_sku_id == "2" | 60 | assert decision.selected_sku_id == "2" |
| 59 | assert decision.selected_text == "Navy Blue X-Large" | 61 | assert decision.selected_text == "Navy Blue X-Large" |
| 60 | - assert decision.matched_stage == "text" | 62 | + assert decision.final_source == "option" |
| 63 | + assert decision.matched_sources == {"color": "option", "size": "option"} | ||
| 64 | + assert decision.matched_stage == "option" # back-compat alias | ||
| 61 | 65 | ||
| 62 | selector.apply_precomputed_decisions(hits, decisions) | 66 | selector.apply_precomputed_decisions(hits, decisions) |
| 63 | 67 | ||
| @@ -103,7 +107,7 @@ def test_style_sku_selector_returns_no_match_without_attribute_contains(): | @@ -103,7 +107,7 @@ def test_style_sku_selector_returns_no_match_without_attribute_contains(): | ||
| 103 | decisions = selector.prepare_hits(hits, parsed_query) | 107 | decisions = selector.prepare_hits(hits, parsed_query) |
| 104 | 108 | ||
| 105 | assert decisions["spu-1"].selected_sku_id is None | 109 | assert decisions["spu-1"].selected_sku_id is None |
| 106 | - assert decisions["spu-1"].matched_stage == "no_match" | 110 | + assert decisions["spu-1"].final_source == "none" |
| 107 | 111 | ||
| 108 | 112 | ||
| 109 | def test_is_text_match_uses_token_boundaries_for_sizes(): | 113 | def test_is_text_match_uses_token_boundaries_for_sizes(): |
| @@ -195,3 +199,341 @@ def test_is_text_match_handles_punctuation_and_descriptive_attribute_values(): | @@ -195,3 +199,341 @@ def test_is_text_match_handles_punctuation_and_descriptive_attribute_values(): | ||
| 195 | assert selector._is_text_match("style", selection_context, normalized_value="off-white/lined") | 199 | assert selector._is_text_match("style", selection_context, normalized_value="off-white/lined") |
| 196 | assert selector._is_text_match("accessory", selection_context, normalized_value="army green + headscarf") | 200 | assert selector._is_text_match("accessory", selection_context, normalized_value="army green + headscarf") |
| 197 | assert selector._is_text_match("size", selection_context, normalized_value="2xl recommended 65-70kg") | 201 | assert selector._is_text_match("size", selection_context, normalized_value="2xl recommended 65-70kg") |
| 202 | + | ||
| 203 | + | ||
| 204 | +def _khaki_intent() -> DetectedStyleIntent: | ||
| 205 | + """Mirrors what StyleIntentDetector now emits (all_terms union of zh/en/attribute).""" | ||
| 206 | + return DetectedStyleIntent( | ||
| 207 | + intent_type="color", | ||
| 208 | + canonical_value="beige", | ||
| 209 | + matched_term="卡其色", | ||
| 210 | + matched_query_text="卡其色", | ||
| 211 | + attribute_terms=("beige", "khaki"), | ||
| 212 | + dimension_aliases=("color", "颜色"), | ||
| 213 | + all_terms=("米色", "卡其色", "beige", "khaki"), | ||
| 214 | + ) | ||
| 215 | + | ||
| 216 | + | ||
| 217 | +def _color_registry() -> StyleIntentRegistry: | ||
| 218 | + return StyleIntentRegistry.from_query_config( | ||
| 219 | + QueryConfig( | ||
| 220 | + style_intent_terms={ | ||
| 221 | + "color": [ | ||
| 222 | + { | ||
| 223 | + "en_terms": ["beige", "khaki"], | ||
| 224 | + "zh_terms": ["米色", "卡其色"], | ||
| 225 | + "attribute_terms": ["beige", "khaki"], | ||
| 226 | + } | ||
| 227 | + ], | ||
| 228 | + }, | ||
| 229 | + style_intent_dimension_aliases={"color": ["color", "颜色"]}, | ||
| 230 | + ) | ||
| 231 | + ) | ||
| 232 | + | ||
| 233 | + | ||
| 234 | +def test_zh_color_intent_matches_noisy_option_value(): | ||
| 235 | + """卡其色裙子 → SKU 的 option1_value 以"卡其色"开头但带 V 领等后缀,也应命中。""" | ||
| 236 | + selector = StyleSkuSelector(_color_registry()) | ||
| 237 | + parsed_query = SimpleNamespace( | ||
| 238 | + style_intent_profile=StyleIntentProfile(intents=(_khaki_intent(),)) | ||
| 239 | + ) | ||
| 240 | + hits = [ | ||
| 241 | + { | ||
| 242 | + "_id": "spu-1", | ||
| 243 | + "_source": { | ||
| 244 | + "option1_name": "颜色", | ||
| 245 | + "skus": [ | ||
| 246 | + {"sku_id": "1", "option1_value": "黑色长裙"}, | ||
| 247 | + {"sku_id": "2", "option1_value": "卡其色v领收腰长裙【常规款】"}, | ||
| 248 | + ], | ||
| 249 | + }, | ||
| 250 | + } | ||
| 251 | + ] | ||
| 252 | + decisions = selector.prepare_hits(hits, parsed_query) | ||
| 253 | + assert decisions["spu-1"].selected_sku_id == "2" | ||
| 254 | + assert decisions["spu-1"].final_source == "option" | ||
| 255 | + | ||
| 256 | + | ||
| 257 | +@pytest.mark.parametrize( | ||
| 258 | + "option_value", | ||
| 259 | + [ | ||
| 260 | + "卡其色(无内衬)", | ||
| 261 | + "卡其色(无内衬)", | ||
| 262 | + "卡其色【常规款】", | ||
| 263 | + "卡其色/常规款", | ||
| 264 | + "卡其色·无内衬", | ||
| 265 | + "卡其色 - 常规", | ||
| 266 | + "卡其色,常规", | ||
| 267 | + "卡其色|常规", | ||
| 268 | + "卡其色—加厚", | ||
| 269 | + ], | ||
| 270 | +) | ||
| 271 | +def test_zh_color_intent_matches_various_brackets_and_separators(option_value: str): | ||
| 272 | + selector = StyleSkuSelector(_color_registry()) | ||
| 273 | + parsed_query = SimpleNamespace( | ||
| 274 | + style_intent_profile=StyleIntentProfile(intents=(_khaki_intent(),)) | ||
| 275 | + ) | ||
| 276 | + hits = [ | ||
| 277 | + { | ||
| 278 | + "_id": "spu-1", | ||
| 279 | + "_source": { | ||
| 280 | + "option1_name": "颜色", | ||
| 281 | + "skus": [ | ||
| 282 | + {"sku_id": "441670", "option1_value": "白色(无内衬)"}, | ||
| 283 | + {"sku_id": "441679", "option1_value": option_value}, | ||
| 284 | + ], | ||
| 285 | + }, | ||
| 286 | + } | ||
| 287 | + ] | ||
| 288 | + assert selector.prepare_hits(hits, parsed_query)["spu-1"].selected_sku_id == "441679" | ||
| 289 | + | ||
| 290 | + | ||
| 291 | +def test_zh_color_intent_matches_noisy_option_value_with_fullwidth_parens(): | ||
| 292 | + """卡其色(无内衬) 是前面 taxonomy-override bug 的实地复现;option 分支现在必须命中。""" | ||
| 293 | + selector = StyleSkuSelector(_color_registry()) | ||
| 294 | + parsed_query = SimpleNamespace( | ||
| 295 | + style_intent_profile=StyleIntentProfile(intents=(_khaki_intent(),)) | ||
| 296 | + ) | ||
| 297 | + hits = [ | ||
| 298 | + { | ||
| 299 | + "_id": "spu-1", | ||
| 300 | + "_source": { | ||
| 301 | + "option1_name": "颜色", | ||
| 302 | + # Even if SPU-level taxonomy existed, white SKU must NOT leak in. | ||
| 303 | + "enriched_taxonomy_attributes": [ | ||
| 304 | + {"name": "Color", "value": {"zh": "卡其色"}} | ||
| 305 | + ], | ||
| 306 | + "skus": [ | ||
| 307 | + {"sku_id": "441670", "option1_value": "白色(无内衬)"}, | ||
| 308 | + {"sku_id": "441679", "option1_value": "卡其色(无内衬)"}, | ||
| 309 | + ], | ||
| 310 | + }, | ||
| 311 | + } | ||
| 312 | + ] | ||
| 313 | + decisions = selector.prepare_hits(hits, parsed_query) | ||
| 314 | + d = decisions["spu-1"] | ||
| 315 | + assert d.selected_sku_id == "441679" | ||
| 316 | + assert d.selected_text == "卡其色(无内衬)" | ||
| 317 | + assert d.final_source == "option" | ||
| 318 | + assert d.matched_sources == {"color": "option"} | ||
| 319 | + | ||
| 320 | + | ||
| 321 | +def test_taxonomy_attribute_extends_text_matching_source(): | ||
| 322 | + """即使 optionN 无法区分 SKU,enriched_taxonomy_attributes 的 Color 也可让 SPU 全部 SKU 通过文本 | ||
| 323 | + 匹配,之后由图像 pick(若有)决定具体 SKU;无图像则取首个。""" | ||
| 324 | + selector = StyleSkuSelector(_color_registry()) | ||
| 325 | + parsed_query = SimpleNamespace( | ||
| 326 | + style_intent_profile=StyleIntentProfile(intents=(_khaki_intent(),)) | ||
| 327 | + ) | ||
| 328 | + hits = [ | ||
| 329 | + { | ||
| 330 | + "_id": "spu-1", | ||
| 331 | + "_source": { | ||
| 332 | + "option1_name": "Style", # unrelated dimension → slot unresolved | ||
| 333 | + "enriched_taxonomy_attributes": [ | ||
| 334 | + {"name": "Color", "value": {"zh": "卡其色", "en": "khaki"}} | ||
| 335 | + ], | ||
| 336 | + "skus": [ | ||
| 337 | + {"sku_id": "a", "option1_value": "A"}, | ||
| 338 | + {"sku_id": "b", "option1_value": "B"}, | ||
| 339 | + ], | ||
| 340 | + }, | ||
| 341 | + } | ||
| 342 | + ] | ||
| 343 | + decisions = selector.prepare_hits(hits, parsed_query) | ||
| 344 | + # Taxonomy matches → both SKUs text-matched; no image pick → first one wins. | ||
| 345 | + d = decisions["spu-1"] | ||
| 346 | + assert d.selected_sku_id == "a" | ||
| 347 | + assert d.final_source == "taxonomy" | ||
| 348 | + # selected_text reflects the real matched taxonomy value, not SKU's unrelated option. | ||
| 349 | + assert d.selected_text == "卡其色" | ||
| 350 | + assert d.matched_sources == {"color": "taxonomy"} | ||
| 351 | + | ||
| 352 | + | ||
| 353 | +def test_taxonomy_does_not_override_contradicting_sku_option_value(): | ||
| 354 | + """SPU 级 taxonomy 说"卡其色",但 SKU 自己的 option1_value 是"白色(无内衬)", | ||
| 355 | + 该 SKU 不应被视作文本命中——避免 SPU 级信号把错色 SKU 顶上去。""" | ||
| 356 | + selector = StyleSkuSelector(_color_registry()) | ||
| 357 | + parsed_query = SimpleNamespace( | ||
| 358 | + style_intent_profile=StyleIntentProfile(intents=(_khaki_intent(),)) | ||
| 359 | + ) | ||
| 360 | + hits = [ | ||
| 361 | + { | ||
| 362 | + "_id": "spu-1", | ||
| 363 | + "_source": { | ||
| 364 | + "option1_name": "颜色", | ||
| 365 | + "enriched_taxonomy_attributes": [ | ||
| 366 | + {"name": "Color", "value": {"zh": "卡其色", "en": "khaki"}} | ||
| 367 | + ], | ||
| 368 | + "skus": [ | ||
| 369 | + {"sku_id": "white", "option1_value": "白色(无内衬)"}, | ||
| 370 | + {"sku_id": "khaki", "option1_value": "卡其色棉"}, | ||
| 371 | + ], | ||
| 372 | + }, | ||
| 373 | + } | ||
| 374 | + ] | ||
| 375 | + decisions = selector.prepare_hits(hits, parsed_query) | ||
| 376 | + # 只有 khaki 自有值匹配;taxonomy 不会把 white 顶出来。 | ||
| 377 | + assert decisions["spu-1"].selected_sku_id == "khaki" | ||
| 378 | + assert decisions["spu-1"].final_source == "option" | ||
| 379 | + | ||
| 380 | + | ||
| 381 | +def test_taxonomy_fills_in_only_when_sku_self_value_is_empty(): | ||
| 382 | + """混合场景:SKU 1 无 option1_value → taxonomy 接管;SKU 2 自带白色 → 不匹配。""" | ||
| 383 | + selector = StyleSkuSelector(_color_registry()) | ||
| 384 | + parsed_query = SimpleNamespace( | ||
| 385 | + style_intent_profile=StyleIntentProfile(intents=(_khaki_intent(),)) | ||
| 386 | + ) | ||
| 387 | + hits = [ | ||
| 388 | + { | ||
| 389 | + "_id": "spu-1", | ||
| 390 | + "_source": { | ||
| 391 | + "option1_name": "颜色", | ||
| 392 | + "enriched_taxonomy_attributes": [ | ||
| 393 | + {"name": "Color", "value": {"zh": "卡其色"}} | ||
| 394 | + ], | ||
| 395 | + "skus": [ | ||
| 396 | + {"sku_id": "no-value", "option1_value": ""}, | ||
| 397 | + {"sku_id": "white", "option1_value": "白色"}, | ||
| 398 | + ], | ||
| 399 | + }, | ||
| 400 | + } | ||
| 401 | + ] | ||
| 402 | + decisions = selector.prepare_hits(hits, parsed_query) | ||
| 403 | + d = decisions["spu-1"] | ||
| 404 | + assert d.selected_sku_id == "no-value" | ||
| 405 | + assert d.final_source == "taxonomy" | ||
| 406 | + assert d.selected_text == "卡其色" | ||
| 407 | + | ||
| 408 | + | ||
| 409 | +def test_image_pick_serves_as_visual_tiebreak_within_text_matched(): | ||
| 410 | + selector = StyleSkuSelector(_color_registry()) | ||
| 411 | + parsed_query = SimpleNamespace( | ||
| 412 | + style_intent_profile=StyleIntentProfile(intents=(_khaki_intent(),)) | ||
| 413 | + ) | ||
| 414 | + hits = [ | ||
| 415 | + { | ||
| 416 | + "_id": "spu-1", | ||
| 417 | + "_source": { | ||
| 418 | + "option1_name": "颜色", | ||
| 419 | + "skus": [ | ||
| 420 | + { | ||
| 421 | + "sku_id": "khaki-cotton", | ||
| 422 | + "option1_value": "卡其色棉", | ||
| 423 | + "image_src": "https://cdn/x/khaki-cotton.jpg", | ||
| 424 | + }, | ||
| 425 | + { | ||
| 426 | + "sku_id": "khaki-linen", | ||
| 427 | + "option1_value": "卡其色麻", | ||
| 428 | + "image_src": "https://cdn/x/khaki-linen.jpg", | ||
| 429 | + }, | ||
| 430 | + ], | ||
| 431 | + }, | ||
| 432 | + "inner_hits": { | ||
| 433 | + "exact_image_knn_query_hits": { | ||
| 434 | + "hits": { | ||
| 435 | + "hits": [ | ||
| 436 | + { | ||
| 437 | + "_score": 0.87, | ||
| 438 | + "_source": {"url": "https://cdn/x/khaki-linen.jpg"}, | ||
| 439 | + } | ||
| 440 | + ] | ||
| 441 | + } | ||
| 442 | + } | ||
| 443 | + }, | ||
| 444 | + } | ||
| 445 | + ] | ||
| 446 | + decisions = selector.prepare_hits(hits, parsed_query) | ||
| 447 | + decision = decisions["spu-1"] | ||
| 448 | + assert decision.selected_sku_id == "khaki-linen" | ||
| 449 | + assert decision.final_source == "option" | ||
| 450 | + assert decision.image_pick_sku_id == "khaki-linen" | ||
| 451 | + assert decision.image_pick_score == 0.87 | ||
| 452 | + | ||
| 453 | + | ||
| 454 | +def test_image_only_selection_when_no_style_intent(): | ||
| 455 | + """无款式意图:仅凭 image_embedding 最近邻 SKU,直接把该 SKU 置顶。""" | ||
| 456 | + selector = StyleSkuSelector(_color_registry()) | ||
| 457 | + parsed_query = SimpleNamespace(style_intent_profile=None) | ||
| 458 | + hits = [ | ||
| 459 | + { | ||
| 460 | + "_id": "spu-1", | ||
| 461 | + "_source": { | ||
| 462 | + "option1_name": "Color", | ||
| 463 | + "image_url": "https://cdn/x/default.jpg", | ||
| 464 | + "skus": [ | ||
| 465 | + { | ||
| 466 | + "sku_id": "red", | ||
| 467 | + "option1_value": "Red", | ||
| 468 | + "image_src": "https://cdn/x/red.jpg", | ||
| 469 | + }, | ||
| 470 | + { | ||
| 471 | + "sku_id": "blue", | ||
| 472 | + "option1_value": "Blue", | ||
| 473 | + "image_src": "https://cdn/x/blue.jpg", | ||
| 474 | + }, | ||
| 475 | + ], | ||
| 476 | + }, | ||
| 477 | + "inner_hits": { | ||
| 478 | + "image_knn_query_hits": { | ||
| 479 | + "hits": { | ||
| 480 | + "hits": [ | ||
| 481 | + {"_score": 0.74, "_source": {"url": "https://cdn/x/blue.jpg"}} | ||
| 482 | + ] | ||
| 483 | + } | ||
| 484 | + } | ||
| 485 | + }, | ||
| 486 | + } | ||
| 487 | + ] | ||
| 488 | + decisions = selector.prepare_hits(hits, parsed_query) | ||
| 489 | + decision = decisions["spu-1"] | ||
| 490 | + assert decision.selected_sku_id == "blue" | ||
| 491 | + assert decision.final_source == "image" | ||
| 492 | + | ||
| 493 | + selector.apply_precomputed_decisions(hits, decisions) | ||
| 494 | + source = hits[0]["_source"] | ||
| 495 | + assert source["skus"][0]["sku_id"] == "blue" | ||
| 496 | + assert source["image_url"] == "https://cdn/x/blue.jpg" | ||
| 497 | + | ||
| 498 | + | ||
| 499 | +def test_image_pick_ignored_when_text_matches_but_visual_url_not_in_text_set(): | ||
| 500 | + """文本命中优先:image-pick 若落在非文本命中 SKU,则不接管。""" | ||
| 501 | + selector = StyleSkuSelector(_color_registry()) | ||
| 502 | + parsed_query = SimpleNamespace( | ||
| 503 | + style_intent_profile=StyleIntentProfile(intents=(_khaki_intent(),)) | ||
| 504 | + ) | ||
| 505 | + hits = [ | ||
| 506 | + { | ||
| 507 | + "_id": "spu-1", | ||
| 508 | + "_source": { | ||
| 509 | + "option1_name": "颜色", | ||
| 510 | + "skus": [ | ||
| 511 | + { | ||
| 512 | + "sku_id": "khaki", | ||
| 513 | + "option1_value": "卡其色", | ||
| 514 | + "image_src": "https://cdn/x/khaki.jpg", | ||
| 515 | + }, | ||
| 516 | + { | ||
| 517 | + "sku_id": "black", | ||
| 518 | + "option1_value": "黑色", | ||
| 519 | + "image_src": "https://cdn/x/black.jpg", | ||
| 520 | + }, | ||
| 521 | + ], | ||
| 522 | + }, | ||
| 523 | + "inner_hits": { | ||
| 524 | + "exact_image_knn_query_hits": { | ||
| 525 | + "hits": { | ||
| 526 | + "hits": [ | ||
| 527 | + {"_score": 0.9, "_source": {"url": "https://cdn/x/black.jpg"}} | ||
| 528 | + ] | ||
| 529 | + } | ||
| 530 | + } | ||
| 531 | + }, | ||
| 532 | + } | ||
| 533 | + ] | ||
| 534 | + decisions = selector.prepare_hits(hits, parsed_query) | ||
| 535 | + decision = decisions["spu-1"] | ||
| 536 | + # Hard text-first: khaki stays, though image pointed at black. | ||
| 537 | + assert decision.selected_sku_id == "khaki" | ||
| 538 | + assert decision.final_source == "option" | ||
| 539 | + assert decision.image_pick_sku_id == "black" |