Commit a32754685e174673aa2918611e14a7eea7427887

Authored by tangwang
1 parent 984f14f9

已把本仓库里的 `/indexer/enrich-content` 本地实现清理掉了,并把 indexer 主链路里对这套实现的隐式依赖一起摘掉。

这次代码侧的核心变化是:
- 删除了 `indexer/product_enrich.py`、`indexer/product_enrich_prompts.py` 及相关单测。
- 在 [api/routes/indexer.py](/data/saas-search/api/routes/indexer.py:55) 移除了 `/indexer/enrich-content` 路由;现在这个路径在本仓库 indexer 服务里会是 `404`,对应契约测试也已改成校验移除状态:[tests/ci/test_service_api_contracts.py](/data/saas-search/tests/ci/test_service_api_contracts.py:345)。
- 在 [api/routes/indexer.py](/data/saas-search/api/routes/indexer.py:183)、[indexer/document_transformer.py](/data/saas-search/indexer/document_transformer.py:109)、[indexer/incremental_service.py](/data/saas-search/indexer/incremental_service.py:587)、[indexer/spu_transformer.py](/data/saas-search/indexer/spu_transformer.py:223) 去掉了构建 doc 时自动补 `qanchors` / `enriched_*` 的本地 LLM 富化逻辑。现在 `build-docs` / `reindex` / `index` 只负责基础文档构建。
- 在 [config/schema.py](/data/saas-search/config/schema.py:316)、[config/loader.py](/data/saas-search/config/loader.py:824)、[config/env_config.py](/data/saas-search/config/env_config.py:37)、[config/config.yaml](/data/saas-search/config/config.yaml:32) 清掉了旧实现专用的 `product_enrich` 和 anchor cache 配置面。

重要文档也一起同步了,重点是把“能力已迁出、本仓库不再内置生成”写明确:
- [README.md](/data/saas-search/README.md:113)
- [docs/搜索API对接指南-00-总览与快速开始.md](</data/saas-search/docs/搜索API对接指南-00-总览与快速开始.md:108>)
- [docs/搜索API对接指南-05-索引接口(Indexer).md](</data/saas-search/docs/搜索API对接指南-05-索引接口(Indexer).md:647>)
- [docs/搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md](</data/saas-search/docs/搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md:441>)
- [docs/工作总结-微服务性能优化与架构.md](</data/saas-search/docs/工作总结-微服务性能优化与架构.md:96>)
- [docs/缓存与Redis使用说明.md](</data/saas-search/docs/缓存与Redis使用说明.md:186>)
- [indexer/README.md](/data/saas-search/indexer/README.md:508)
- [indexer/ANCHORS_AND_SEMANTIC_ATTRIBUTES.md](/data/saas-search/indexer/ANCHORS_AND_SEMANTIC_ATTRIBUTES.md:1)

验证做了两步:
- `python3 -m compileall ...` 通过
- `source activate.sh && python -m pytest tests/ci/test_service_api_contracts.py -q` 通过,`31 passed`

剩下我认为仍有“旧信息”但暂时没动的,主要是历史记录类文档,不算当前对接口径:
- [docs/issues/issue.md](/data/saas-search/docs/issues/issue.md:295)
- [docs/issues/issue.txt](/data/saas-search/docs/issues/issue.txt:468)
- [docs/issues/issue-2026-03-29-索引修改-done-0330.md](</data/saas-search/docs/issues/issue-2026-03-29-索引修改-done-0330.md:23>)
- [docs/issues/issue-2026-04-04-增加多模态标注-TODO.md](</data/saas-search/docs/issues/issue-2026-04-04-增加多模态标注-TODO.md:1>)

另外,工作区里原本就有 `.env` 修改和未跟踪的 `AGENTS.md`,我没有动它们。
@@ -4,7 +4,7 @@ @@ -4,7 +4,7 @@
4 ES_HOST=http://localhost:9200 4 ES_HOST=http://localhost:9200
5 ES_USERNAME=saas 5 ES_USERNAME=saas
6 ES_PASSWORD=4hOaLaf41y2VuI8y 6 ES_PASSWORD=4hOaLaf41y2VuI8y
7 -ES_AUTH="${ES_USERNAME}:${ES_PASSWORD}" 7 +ES_AUTH="saas:4hOaLaf41y2VuI8y"
8 8
9 # Redis Configuration (Optional) - AI 生产 10.200.16.14:6479 9 # Redis Configuration (Optional) - AI 生产 10.200.16.14:6479
10 REDIS_HOST=10.200.16.14 10 REDIS_HOST=10.200.16.14
AGENTS.md 0 → 100644
@@ -0,0 +1,17 @@ @@ -0,0 +1,17 @@
  1 +# FacetAwareMatching 协作记忆
  2 +
  3 +## 开发原则
  4 +
  5 +默认遵循以下错误处理原则:
  6 +
  7 +- 对于代码缺陷、逻辑疏漏、配置或资源缺失、违反统一约定等由自身原因导致的错误,应尽早暴露、快速失败,不做回退或容错处理,以保持代码精简、清晰、统一。
  8 +- 对于线上超时、第三方接口异常等不可预见的外部错误,应提供必要的兜底、回退、重试或其他容错措施,以保证系统稳定性和业务连续性。
  9 +- 进行功能迭代或重构时,默认直接面向最终方案和最优设计实现,不主动为历史实现、旧数据、过渡状态或遗留调用方式做兼容;优先推动代码回到统一约定和一致模型,避免长期并存的双轨逻辑、分支特判和临时过渡层。
  10 +
  11 +## 落地要求
  12 +
  13 +- 不要用静默吞错、默认值掩盖、隐式降级等方式隐藏内部问题。
  14 +- 发现内部前置条件不满足时,应优先抛错、失败并暴露上下文。
  15 +- 设计容错逻辑时,应明确区分“内部错误”和“外部错误”,避免把内部问题包装成可忽略事件。
  16 +- 新设计一旦确定,应优先整体替换旧约定,而不是通过兼容旧行为来维持表面稳定。
  17 +- 除非有明确、必要的外部兼容性约束,否则不要为内部历史包袱保留额外分支。
@@ -110,7 +110,7 @@ source activate.sh @@ -110,7 +110,7 @@ source activate.sh
110 | `搜索API对接指南-01-搜索接口.md` | `POST /search/` 请求与响应 | 110 | `搜索API对接指南-01-搜索接口.md` | `POST /search/` 请求与响应 |
111 | `搜索API对接指南-02-搜索建议与即时搜索.md` | 建议 / 即时搜索 | 111 | `搜索API对接指南-02-搜索建议与即时搜索.md` | 建议 / 即时搜索 |
112 | `搜索API对接指南-03-获取文档.md` | `GET /search/{doc_id}` | 112 | `搜索API对接指南-03-获取文档.md` | `GET /search/{doc_id}` |
113 -| `搜索API对接指南-05-索引接口(Indexer).md` | 索引与 `build-docs` / `enrich-content` 等 | 113 +| `搜索API对接指南-05-索引接口(Indexer).md` | 索引与 `build-docs` 等(`enrich-content` 已迁出) |
114 | `搜索API对接指南-06-管理接口(Admin).md` | `/admin/*` | 114 | `搜索API对接指南-06-管理接口(Admin).md` | `/admin/*` |
115 | `搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md` | 6005/6006/6007/6008 等直连说明 | 115 | `搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md` | 6005/6006/6007/6008 等直连说明 |
116 | `搜索API对接指南-08-数据模型与字段速查.md` | 字段与数据模型 | 116 | `搜索API对接指南-08-数据模型与字段速查.md` | 字段与数据模型 |
api/routes/indexer.py
@@ -7,7 +7,7 @@ @@ -7,7 +7,7 @@
7 import asyncio 7 import asyncio
8 import re 8 import re
9 from fastapi import APIRouter, HTTPException 9 from fastapi import APIRouter, HTTPException
10 -from typing import Any, Dict, List, Literal, Optional 10 +from typing import Any, Dict, List, Optional
11 from pydantic import BaseModel, Field 11 from pydantic import BaseModel, Field
12 import logging 12 import logging
13 from sqlalchemy import text 13 from sqlalchemy import text
@@ -19,11 +19,6 @@ logger = logging.getLogger(__name__) @@ -19,11 +19,6 @@ logger = logging.getLogger(__name__)
19 19
20 router = APIRouter(prefix="/indexer", tags=["indexer"]) 20 router = APIRouter(prefix="/indexer", tags=["indexer"])
21 21
22 -SUPPORTED_CATEGORY_TAXONOMY_PROFILES = (  
23 - "apparel, 3c, bags, pet_supplies, electronics, outdoor, "  
24 - "home_appliances, home_living, wigs, beauty, accessories, toys, shoes, sports, others"  
25 -)  
26 -  
27 22
28 class ReindexRequest(BaseModel): 23 class ReindexRequest(BaseModel):
29 """全量重建索引请求""" 24 """全量重建索引请求"""
@@ -64,6 +59,7 @@ class BuildDocsRequest(BaseModel): @@ -64,6 +59,7 @@ class BuildDocsRequest(BaseModel):
64 该接口是 Java 等外部索引程序正式使用的“doc 生成接口”: 59 该接口是 Java 等外部索引程序正式使用的“doc 生成接口”:
65 - 上游负责:全量 / 增量调度 + 从 MySQL 查询出各表数据 60 - 上游负责:全量 / 增量调度 + 从 MySQL 查询出各表数据
66 - 本模块负责:根据配置和算法,将原始行数据转换为与 mappings/search_products.json 一致的 ES 文档 61 - 本模块负责:根据配置和算法,将原始行数据转换为与 mappings/search_products.json 一致的 ES 文档
  62 + - 注意:已迁出的 `/indexer/enrich-content` 内容理解能力不再由本接口内置生成
67 """ 63 """
68 tenant_id: str = Field(..., description="租户 ID,用于加载租户配置、语言策略等") 64 tenant_id: str = Field(..., description="租户 ID,用于加载租户配置、语言策略等")
69 items: List[BuildDocItem] = Field(..., description="需要构建 doc 的 SPU 列表(含其 SKUs 和 Options)") 65 items: List[BuildDocItem] = Field(..., description="需要构建 doc 的 SPU 列表(含其 SKUs 和 Options)")
@@ -82,55 +78,6 @@ class BuildDocsFromDbRequest(BaseModel): @@ -82,55 +78,6 @@ class BuildDocsFromDbRequest(BaseModel):
82 spu_ids: List[str] = Field(..., description="需要构建 doc 的 SPU ID 列表") 78 spu_ids: List[str] = Field(..., description="需要构建 doc 的 SPU ID 列表")
83 79
84 80
85 -class EnrichContentItem(BaseModel):  
86 - """单条待生成内容理解字段的商品。"""  
87 - spu_id: str = Field(..., description="SPU ID")  
88 - title: str = Field(..., description="商品标题,用于 LLM 分析生成 qanchors / enriched_tags 等")  
89 - image_url: Optional[str] = Field(None, description="商品主图 URL(预留给多模态/内容理解扩展)")  
90 - brief: Optional[str] = Field(None, description="商品简介/短描述")  
91 - description: Optional[str] = Field(None, description="商品详情/长描述")  
92 -  
93 -  
94 -class EnrichContentRequest(BaseModel):  
95 - """  
96 - 内容理解字段生成请求:根据商品标题批量生成通用增强字段与品类 taxonomy 字段。  
97 - 供外部 indexer 在自行组织 doc 时调用,与翻译、向量化等微服务并列。  
98 - """  
99 - tenant_id: str = Field(..., description="租户 ID,用于请求路由与结果归属,不参与缓存键")  
100 - items: List[EnrichContentItem] = Field(..., description="待分析的 SPU 列表(spu_id + title,可附带 brief/description/image_url)")  
101 - enrichment_scopes: Optional[List[Literal["generic", "category_taxonomy"]]] = Field(  
102 - default=None,  
103 - description=(  
104 - "要执行的增强范围。"  
105 - "`generic` 返回 qanchors/enriched_tags/enriched_attributes;"  
106 - "`category_taxonomy` 返回 enriched_taxonomy_attributes。"  
107 - "默认两者都执行。"  
108 - ),  
109 - )  
110 - category_taxonomy_profile: str = Field(  
111 - "apparel",  
112 - description=(  
113 - "品类 taxonomy profile。默认 `apparel`。"  
114 - f"当前支持:{SUPPORTED_CATEGORY_TAXONOMY_PROFILES}。"  
115 - "其中除 `apparel` 外,其余 profile 的 taxonomy 输出仅返回 `en`。"  
116 - ),  
117 - )  
118 - analysis_kinds: Optional[List[Literal["content", "taxonomy"]]] = Field(  
119 - default=None,  
120 - description="Deprecated alias of enrichment_scopes. `content` -> `generic`, `taxonomy` -> `category_taxonomy`.",  
121 - )  
122 -  
123 - def resolved_enrichment_scopes(self) -> List[str]:  
124 - if self.enrichment_scopes:  
125 - return list(self.enrichment_scopes)  
126 - if self.analysis_kinds:  
127 - mapped = []  
128 - for item in self.analysis_kinds:  
129 - mapped.append("generic" if item == "content" else "category_taxonomy")  
130 - return mapped  
131 - return ["generic", "category_taxonomy"]  
132 -  
133 -  
134 @router.post("/reindex") 81 @router.post("/reindex")
135 async def reindex_all(request: ReindexRequest): 82 async def reindex_all(request: ReindexRequest):
136 """ 83 """
@@ -239,8 +186,9 @@ async def build_docs(request: BuildDocsRequest): @@ -239,8 +186,9 @@ async def build_docs(request: BuildDocsRequest):
239 186
240 使用场景: 187 使用场景:
241 - 上游(例如 Java 索引程序)已经从 MySQL 查询出了 SPU / SKU / Option 等原始行数据 188 - 上游(例如 Java 索引程序)已经从 MySQL 查询出了 SPU / SKU / Option 等原始行数据
242 - - 希望复用本项目的全部“索引富化”能力(多语言、翻译、向量、规格聚合等) 189 + - 希望复用本项目当前保留的“索引构建”能力(多语言、翻译、向量、规格聚合等)
243 - 只需要拿到与 `mappings/search_products.json` 一致的 doc 列表,由上游自行写入 ES 190 - 只需要拿到与 `mappings/search_products.json` 一致的 doc 列表,由上游自行写入 ES
  191 + - 如需 `qanchors` / `enriched_attributes` / `enriched_taxonomy_attributes`,请由外部内容理解服务生成后再自行合并
244 """ 192 """
245 try: 193 try:
246 if not request.items: 194 if not request.items:
@@ -260,7 +208,6 @@ async def build_docs(request: BuildDocsRequest): @@ -260,7 +208,6 @@ async def build_docs(request: BuildDocsRequest):
260 import pandas as pd 208 import pandas as pd
261 209
262 docs: List[Dict[str, Any]] = [] 210 docs: List[Dict[str, Any]] = []
263 - doc_spu_rows: List[pd.Series] = []  
264 failed: List[Dict[str, Any]] = [] 211 failed: List[Dict[str, Any]] = []
265 212
266 for item in request.items: 213 for item in request.items:
@@ -276,7 +223,6 @@ async def build_docs(request: BuildDocsRequest): @@ -276,7 +223,6 @@ async def build_docs(request: BuildDocsRequest):
276 spu_row=spu_row, 223 spu_row=spu_row,
277 skus=skus_df, 224 skus=skus_df,
278 options=options_df, 225 options=options_df,
279 - fill_llm_attributes=False,  
280 ) 226 )
281 227
282 if doc is None: 228 if doc is None:
@@ -316,7 +262,6 @@ async def build_docs(request: BuildDocsRequest): @@ -316,7 +262,6 @@ async def build_docs(request: BuildDocsRequest):
316 doc["title_embedding"] = emb0.tolist() 262 doc["title_embedding"] = emb0.tolist()
317 263
318 docs.append(doc) 264 docs.append(doc)
319 - doc_spu_rows.append(spu_row)  
320 except Exception as e: 265 except Exception as e:
321 failed.append( 266 failed.append(
322 { 267 {
@@ -325,13 +270,6 @@ async def build_docs(request: BuildDocsRequest): @@ -325,13 +270,6 @@ async def build_docs(request: BuildDocsRequest):
325 } 270 }
326 ) 271 )
327 272
328 - # 批量填充 LLM 字段(尽量攒批,每次最多 20 条;失败仅 warning,不影响 build-docs 主功能)  
329 - try:  
330 - if docs and doc_spu_rows:  
331 - transformer.fill_llm_attributes_batch(docs, doc_spu_rows)  
332 - except Exception as e:  
333 - logger.warning("Batch LLM fill failed in build-docs (tenant_id=%s): %s", request.tenant_id, e)  
334 -  
335 return { 273 return {
336 "tenant_id": request.tenant_id, 274 "tenant_id": request.tenant_id,
337 "docs": docs, 275 "docs": docs,
@@ -476,101 +414,6 @@ async def build_docs_from_db(request: BuildDocsFromDbRequest): @@ -476,101 +414,6 @@ async def build_docs_from_db(request: BuildDocsFromDbRequest):
476 raise HTTPException(status_code=500, detail=f"Internal server error: {str(e)}") 414 raise HTTPException(status_code=500, detail=f"Internal server error: {str(e)}")
477 415
478 416
479 -def _run_enrich_content(  
480 - tenant_id: str,  
481 - items: List[Dict[str, str]],  
482 - enrichment_scopes: Optional[List[str]] = None,  
483 - category_taxonomy_profile: str = "apparel",  
484 -) -> List[Dict[str, Any]]:  
485 - """  
486 - 同步执行内容理解,返回与 ES mapping 对齐的字段结构。  
487 - 语言策略由 product_enrich 内部统一决定,路由层不参与。  
488 - """  
489 - from indexer.product_enrich import build_index_content_fields  
490 -  
491 - results = build_index_content_fields(  
492 - items=items,  
493 - tenant_id=tenant_id,  
494 - enrichment_scopes=enrichment_scopes,  
495 - category_taxonomy_profile=category_taxonomy_profile,  
496 - )  
497 - return [  
498 - {  
499 - "spu_id": item["id"],  
500 - "qanchors": item["qanchors"],  
501 - "enriched_attributes": item["enriched_attributes"],  
502 - "enriched_tags": item["enriched_tags"],  
503 - "enriched_taxonomy_attributes": item["enriched_taxonomy_attributes"],  
504 - **({"error": item["error"]} if item.get("error") else {}),  
505 - }  
506 - for item in results  
507 - ]  
508 -  
509 -  
510 -@router.post("/enrich-content")  
511 -async def enrich_content(request: EnrichContentRequest):  
512 - """  
513 - 内容理解字段生成接口:根据商品标题批量生成通用增强字段与品类 taxonomy 字段。  
514 -  
515 - 使用场景:  
516 - - 外部 indexer 采用「微服务组合」方式自己组织 doc 时,可调用本接口获取 LLM 生成的  
517 - 锚文本与语义属性,再与翻译、向量化结果合并写入 ES。  
518 - - 与 /indexer/build-docs 解耦,避免 build-docs 因 LLM 耗时过长而阻塞;调用方可  
519 - 先拿不含 qanchors/enriched_tags/taxonomy attributes 的 doc,再异步或离线补齐本接口结果后更新 ES。  
520 -  
521 - 实现逻辑与 indexer.product_enrich.build_index_content_fields 一致,支持多语言与 Redis 缓存。  
522 - """  
523 - try:  
524 - if not request.items:  
525 - raise HTTPException(status_code=400, detail="items cannot be empty")  
526 - if len(request.items) > 50:  
527 - raise HTTPException(  
528 - status_code=400,  
529 - detail="Maximum 50 items per request for enrich-content (LLM batch limit)",  
530 - )  
531 -  
532 - items_payload = [  
533 - {  
534 - "spu_id": it.spu_id,  
535 - "title": it.title or "",  
536 - "brief": it.brief or "",  
537 - "description": it.description or "",  
538 - "image_url": it.image_url or "",  
539 - }  
540 - for it in request.items  
541 - ]  
542 - loop = asyncio.get_event_loop()  
543 - enrichment_scopes = request.resolved_enrichment_scopes()  
544 - result = await loop.run_in_executor(  
545 - None,  
546 - lambda: _run_enrich_content(  
547 - tenant_id=request.tenant_id,  
548 - items=items_payload,  
549 - enrichment_scopes=enrichment_scopes,  
550 - category_taxonomy_profile=request.category_taxonomy_profile,  
551 - ),  
552 - )  
553 - return {  
554 - "tenant_id": request.tenant_id,  
555 - "enrichment_scopes": enrichment_scopes,  
556 - "category_taxonomy_profile": request.category_taxonomy_profile,  
557 - "results": result,  
558 - "total": len(result),  
559 - }  
560 - except HTTPException:  
561 - raise  
562 - except RuntimeError as e:  
563 - if "DASHSCOPE_API_KEY" in str(e) or "cannot call LLM" in str(e).lower():  
564 - raise HTTPException(  
565 - status_code=503,  
566 - detail="Content understanding service unavailable: DASHSCOPE_API_KEY not set",  
567 - )  
568 - raise HTTPException(status_code=500, detail=str(e))  
569 - except Exception as e:  
570 - logger.error(f"Error in enrich-content for tenant_id={request.tenant_id}: {e}", exc_info=True)  
571 - raise HTTPException(status_code=500, detail=f"Internal server error: {str(e)}")  
572 -  
573 -  
574 @router.post("/documents") 417 @router.post("/documents")
575 async def get_documents(request: GetDocumentsRequest): 418 async def get_documents(request: GetDocumentsRequest):
576 """ 419 """
config/config.yaml
@@ -38,8 +38,6 @@ infrastructure: @@ -38,8 +38,6 @@ infrastructure:
38 retry_on_timeout: false 38 retry_on_timeout: false
39 cache_expire_days: 720 39 cache_expire_days: 720
40 embedding_cache_prefix: embedding 40 embedding_cache_prefix: embedding
41 - anchor_cache_prefix: product_anchors  
42 - anchor_cache_expire_days: 30  
43 database: 41 database:
44 host: null 42 host: null
45 port: 3306 43 port: 3306
@@ -60,10 +58,6 @@ indexes: [] @@ -60,10 +58,6 @@ indexes: []
60 assets: 58 assets:
61 query_rewrite_dictionary_path: config/dictionaries/query_rewrite.dict 59 query_rewrite_dictionary_path: config/dictionaries/query_rewrite.dict
62 60
63 -# Product content understanding (LLM enrich-content) configuration  
64 -product_enrich:  
65 - max_workers: 40  
66 -  
67 # 离线 / Web 相关性评估(scripts/evaluation、eval-web) 61 # 离线 / Web 相关性评估(scripts/evaluation、eval-web)
68 # CLI 未显式传参时使用此处默认值;search_base_url 未配置时自动为 http://127.0.0.1:{runtime.api_port} 62 # CLI 未显式传参时使用此处默认值;search_base_url 未配置时自动为 http://127.0.0.1:{runtime.api_port}
69 search_evaluation: 63 search_evaluation:
config/env_config.py
@@ -46,8 +46,6 @@ def _redis_dict() -&gt; Dict[str, Any]: @@ -46,8 +46,6 @@ def _redis_dict() -&gt; Dict[str, Any]:
46 "retry_on_timeout": cfg.retry_on_timeout, 46 "retry_on_timeout": cfg.retry_on_timeout,
47 "cache_expire_days": cfg.cache_expire_days, 47 "cache_expire_days": cfg.cache_expire_days,
48 "embedding_cache_prefix": cfg.embedding_cache_prefix, 48 "embedding_cache_prefix": cfg.embedding_cache_prefix,
49 - "anchor_cache_prefix": cfg.anchor_cache_prefix,  
50 - "anchor_cache_expire_days": cfg.anchor_cache_expire_days,  
51 } 49 }
52 50
53 51
@@ -38,7 +38,6 @@ from config.schema import ( @@ -38,7 +38,6 @@ from config.schema import (
38 IndexConfig, 38 IndexConfig,
39 InfrastructureConfig, 39 InfrastructureConfig,
40 QueryConfig, 40 QueryConfig,
41 - ProductEnrichConfig,  
42 RedisSettings, 41 RedisSettings,
43 RerankConfig, 42 RerankConfig,
44 RerankFusionConfig, 43 RerankFusionConfig,
@@ -260,10 +259,6 @@ class AppConfigLoader: @@ -260,10 +259,6 @@ class AppConfigLoader:
260 runtime_config = self._build_runtime_config() 259 runtime_config = self._build_runtime_config()
261 infrastructure_config = self._build_infrastructure_config(runtime_config.environment) 260 infrastructure_config = self._build_infrastructure_config(runtime_config.environment)
262 261
263 - product_enrich_raw = raw.get("product_enrich") if isinstance(raw.get("product_enrich"), dict) else {}  
264 - product_enrich_config = ProductEnrichConfig(  
265 - max_workers=int(product_enrich_raw.get("max_workers", 40)),  
266 - )  
267 search_evaluation_config = self._build_search_evaluation_config(raw, runtime_config) 262 search_evaluation_config = self._build_search_evaluation_config(raw, runtime_config)
268 263
269 metadata = ConfigMetadata( 264 metadata = ConfigMetadata(
@@ -275,7 +270,6 @@ class AppConfigLoader: @@ -275,7 +270,6 @@ class AppConfigLoader:
275 app_config = AppConfig( 270 app_config = AppConfig(
276 runtime=runtime_config, 271 runtime=runtime_config,
277 infrastructure=infrastructure_config, 272 infrastructure=infrastructure_config,
278 - product_enrich=product_enrich_config,  
279 search=search_config, 273 search=search_config,
280 services=services_config, 274 services=services_config,
281 tenants=tenants_config, 275 tenants=tenants_config,
@@ -288,7 +282,6 @@ class AppConfigLoader: @@ -288,7 +282,6 @@ class AppConfigLoader:
288 return AppConfig( 282 return AppConfig(
289 runtime=app_config.runtime, 283 runtime=app_config.runtime,
290 infrastructure=app_config.infrastructure, 284 infrastructure=app_config.infrastructure,
291 - product_enrich=app_config.product_enrich,  
292 search=app_config.search, 285 search=app_config.search,
293 services=app_config.services, 286 services=app_config.services,
294 tenants=app_config.tenants, 287 tenants=app_config.tenants,
@@ -838,8 +831,6 @@ class AppConfigLoader: @@ -838,8 +831,6 @@ class AppConfigLoader:
838 retry_on_timeout=os.getenv("REDIS_RETRY_ON_TIMEOUT", "false").strip().lower() == "true", 831 retry_on_timeout=os.getenv("REDIS_RETRY_ON_TIMEOUT", "false").strip().lower() == "true",
839 cache_expire_days=int(os.getenv("REDIS_CACHE_EXPIRE_DAYS", 360 * 2)), 832 cache_expire_days=int(os.getenv("REDIS_CACHE_EXPIRE_DAYS", 360 * 2)),
840 embedding_cache_prefix=os.getenv("REDIS_EMBEDDING_CACHE_PREFIX", "embedding"), 833 embedding_cache_prefix=os.getenv("REDIS_EMBEDDING_CACHE_PREFIX", "embedding"),
841 - anchor_cache_prefix=os.getenv("REDIS_ANCHOR_CACHE_PREFIX", "product_anchors"),  
842 - anchor_cache_expire_days=int(os.getenv("REDIS_ANCHOR_CACHE_EXPIRE_DAYS", 30)),  
843 ), 834 ),
844 database=DatabaseSettings( 835 database=DatabaseSettings(
845 host=os.getenv("DB_HOST"), 836 host=os.getenv("DB_HOST"),
@@ -323,8 +323,6 @@ class RedisSettings: @@ -323,8 +323,6 @@ class RedisSettings:
323 retry_on_timeout: bool = False 323 retry_on_timeout: bool = False
324 cache_expire_days: int = 720 324 cache_expire_days: int = 720
325 embedding_cache_prefix: str = "embedding" 325 embedding_cache_prefix: str = "embedding"
326 - anchor_cache_prefix: str = "product_anchors"  
327 - anchor_cache_expire_days: int = 30  
328 326
329 327
330 @dataclass(frozen=True) 328 @dataclass(frozen=True)
@@ -351,13 +349,6 @@ class InfrastructureConfig: @@ -351,13 +349,6 @@ class InfrastructureConfig:
351 349
352 350
353 @dataclass(frozen=True) 351 @dataclass(frozen=True)
354 -class ProductEnrichConfig:  
355 - """Configuration for LLM-based product content understanding (enrich-content)."""  
356 -  
357 - max_workers: int = 40  
358 -  
359 -  
360 -@dataclass(frozen=True)  
361 class RuntimeConfig: 352 class RuntimeConfig:
362 environment: str = "prod" 353 environment: str = "prod"
363 index_namespace: str = "" 354 index_namespace: str = ""
@@ -430,7 +421,6 @@ class AppConfig: @@ -430,7 +421,6 @@ class AppConfig:
430 421
431 runtime: RuntimeConfig 422 runtime: RuntimeConfig
432 infrastructure: InfrastructureConfig 423 infrastructure: InfrastructureConfig
433 - product_enrich: ProductEnrichConfig  
434 search: SearchConfig 424 search: SearchConfig
435 services: ServicesConfig 425 services: ServicesConfig
436 tenants: TenantCatalogConfig 426 tenants: TenantCatalogConfig
docs/工作总结-微服务性能优化与架构.md
@@ -93,19 +93,15 @@ instruction: &quot;Given a shopping query, rank product titles by relevance&quot; @@ -93,19 +93,15 @@ instruction: &quot;Given a shopping query, rank product titles by relevance&quot;
93 93
94 --- 94 ---
95 95
96 -### 5. 内容理解字段(支撑 Suggest 96 +### 5. 内容理解字段(已迁出
97 97
98 -**能力**:支持根据商品标题批量生成 **qanchors**(锚文本)、**enriched_attributes**、**tags**,供索引与 suggest 使用 98 +`qanchors`、`enriched_attributes`、`enriched_tags`、`enriched_taxonomy_attributes` 这些字段模型仍保留在索引结构里,`suggestion/builder.py` 等消费侧也仍可继续使用 ES 中已有的 `qanchors`。但字段生成服务与其本地实现已经迁移到独立项目,本仓库不再提供 `/indexer/enrich-content`,也不再在 indexer 构建链路内自动补齐这些字段
99 99
100 -**具体内容**:  
101 -- **接口**:`POST /indexer/enrich-content`(FacetAwareMatching 服务端口 **6001**)。请求体为 `items` 数组,每项含 `spu_id`、`title`(必填)及可选多语言标题等;单次请求最多 **50 条**,建议批量调用。响应 `results` 与 `items` 一一对应,每项含 `spu_id`、`qanchors`(按语言键,如 `qanchors.zh`、`qanchors.en`,逗号分隔短语)、`enriched_attributes`、`tags`。  
102 -- **索引侧**:微服务组合方式下,调用方先拿不含 qanchors/tags 的 doc,再调用本接口补齐后写入 ES 的 `qanchors.{lang}` 等字段;索引 transformer(`indexer/document_transformer.py`、`indexer/product_enrich.py`)内也可在构建 doc 时调用内容理解逻辑,写入 `qanchors.{lang}`。  
103 -- **Suggest 侧**:`suggestion/builder.py` 从 ES 商品索引读取 `_source: ["id", "spu_id", "title", "qanchors"]`,对 `qanchors.{lang}` 用 `_split_qanchors` 拆成词条,以 `source="qanchor"` 加入候选,排序时 `qanchor` 权重大于纯 title(`add_product("qanchor", ...)`);suggest 配置中 `sources: ["query_log", "qanchor"]` 表示候选来源包含 qanchor。  
104 -- **实现与依赖**:内容理解内部使用大模型(需 `DASHSCOPE_API_KEY`),支持多语言与 Redis 缓存(如 `product_anchors`);逻辑与 `indexer/product_enrich` 一致。  
105 -  
106 -**状态**:内容理解字段已接入索引与 suggest 链路;依赖内容理解(qanchors/tags)的**全量数据尚未全部完成一轮**,后续需持续跑满并校验效果。 100 +当前边界:
107 101
108 -详见:`indexer/ANCHORS_AND_SEMANTIC_ATTRIBUTES.md`、`docs/搜索API对接指南-05-索引接口(Indexer).md`(`enrich-content` 等)、`api/routes/indexer.py`(enrich-content 路由)。 102 +- 本仓库负责基础 doc 构建、多语言字段、向量、规格聚合等索引能力。
  103 +- 独立内容理解服务负责生成 `qanchors` / `enriched_*`。
  104 +- 上游索引程序负责把两侧结果合并后写入 ES。
109 105
110 --- 106 ---
111 107
@@ -145,7 +141,7 @@ instruction: &quot;Given a shopping query, rank product titles by relevance&quot; @@ -145,7 +141,7 @@ instruction: &quot;Given a shopping query, rank product titles by relevance&quot;
145 - **增量示例**:`./scripts/build_suggestions.sh 162 --mode incremental --overlap-minutes 30`(按 watermark 增量更新);脚本内部调用 `main.py build-suggestions --tenant-id <id> ...`。 141 - **增量示例**:`./scripts/build_suggestions.sh 162 --mode incremental --overlap-minutes 30`(按 watermark 增量更新);脚本内部调用 `main.py build-suggestions --tenant-id <id> ...`。
146 - 构建逻辑在 `suggestion/builder.py` 的 `SuggestionIndexBuilder`:从 ES 商品索引(含 `title`、`qanchors`)与查询日志等拉取数据,写入 versioned 建议索引并切换 alias。 142 - 构建逻辑在 `suggestion/builder.py` 的 `SuggestionIndexBuilder`:从 ES 商品索引(含 `title`、`qanchors`)与查询日志等拉取数据,写入 versioned 建议索引并切换 alias。
147 - **尚未完成的“增量机制”**:指**自动/事件驱动的增量**(如商品变更或日志写入时自动刷新建议索引);当前 incremental 模式为“按 watermark 再跑一次构建”,仍为脚本主动触发,非持续增量流水线。 143 - **尚未完成的“增量机制”**:指**自动/事件驱动的增量**(如商品变更或日志写入时自动刷新建议索引);当前 incremental 模式为“按 watermark 再跑一次构建”,仍为脚本主动触发,非持续增量流水线。
148 -- **依赖**:suggest 候选依赖商品侧 **内容理解字段**(qanchors/tags);`sources: ["query_log", "qanchor"]` 表示候选来自查询日志与 qanchor;当前内容理解未全量跑完一轮,suggest 数据会随全量重建逐步完善 144 +- **依赖**:suggest 候选依赖商品侧 **内容理解字段**(qanchors/tags);`sources: ["query_log", "qanchor"]` 表示候选来自查询日志与 qanchor。字段生成职责已迁移到独立内容理解服务
149 145
150 详见:`suggestion/builder.py`、`suggestion/ARCHITECTURE_V2.md`、`main.py`(build-suggestions 子命令)。 146 详见:`suggestion/builder.py`、`suggestion/ARCHITECTURE_V2.md`、`main.py`(build-suggestions 子命令)。
151 147
@@ -241,7 +237,7 @@ cd /data/saas-search @@ -241,7 +237,7 @@ cd /data/saas-search
241 | **Embedding** | TEI 替代 SentenceTransformers/vLLM 作为文本向量后端,兼顾性能与工程化(Docker、配置化、T4 调优);图片向量由 clip-as-service 承担。 | 237 | **Embedding** | TEI 替代 SentenceTransformers/vLLM 作为文本向量后端,兼顾性能与工程化(Docker、配置化、T4 调优);图片向量由 clip-as-service 承担。 |
242 | **Reranker** | vLLM + Qwen3-Reranker-0.6B,针对 T4 做 float16、prefix caching、CUDA 图、按长度分批及 batch/长度参数搜索;高并发场景可选用 DashScope 云后端。 | 238 | **Reranker** | vLLM + Qwen3-Reranker-0.6B,针对 T4 做 float16、prefix caching、CUDA 图、按长度分批及 batch/长度参数搜索;高并发场景可选用 DashScope 云后端。 |
243 | **翻译** | 因 qwen-mt 限速(RPM≈60),迁移至可配置的 qwen-flash 等方案,支撑在线索引与 query;需金伟侧对索引做流量控制。 | 239 | **翻译** | 因 qwen-mt 限速(RPM≈60),迁移至可配置的 qwen-flash 等方案,支撑在线索引与 query;需金伟侧对索引做流量控制。 |
244 -| **内容理解** | 提供 qanchors/tags 等字段生成接口,支撑 suggest 与检索增强;全量一轮尚未完全跑满。 | 240 +| **内容理解** | 字段模型仍可被检索与 suggest 消费,但生成服务已迁移到独立项目;本仓库不再内置该实现。 |
245 | **架构** | Provider 动态选择翻译;service_ctl 统一监控与拉起;suggest 目前全量脚本触发,增量待做。 | 241 | **架构** | Provider 动态选择翻译;service_ctl 统一监控与拉起;suggest 目前全量脚本触发,增量待做。 |
246 | **性能基线** | 向量化扩展性良好;reranker 为整链瓶颈(386 docs 约 0.6 rps);search 约 8 rps;suggest 约 200+ rps。 | 242 | **性能基线** | 向量化扩展性良好;reranker 为整链瓶颈(386 docs 约 0.6 rps);search 约 8 rps;suggest 约 200+ rps。 |
247 243
docs/搜索API对接指南-00-总览与快速开始.md
@@ -90,7 +90,6 @@ curl -X POST &quot;http://43.166.252.75:6002/search/&quot; \ @@ -90,7 +90,6 @@ curl -X POST &quot;http://43.166.252.75:6002/search/&quot; \
90 | 查询文档 | POST | `/indexer/documents` | 查询SPU文档数据(不写入ES) | 90 | 查询文档 | POST | `/indexer/documents` | 查询SPU文档数据(不写入ES) |
91 | 构建ES文档(正式对接) | POST | `/indexer/build-docs` | 基于上游提供的 MySQL 行数据构建 ES doc,不写入 ES,供 Java 等调用后自行写入 | 91 | 构建ES文档(正式对接) | POST | `/indexer/build-docs` | 基于上游提供的 MySQL 行数据构建 ES doc,不写入 ES,供 Java 等调用后自行写入 |
92 | 构建ES文档(测试用) | POST | `/indexer/build-docs-from-db` | 仅在测试/调试时使用,根据 `tenant_id + spu_ids` 内部查库并构建 ES doc | 92 | 构建ES文档(测试用) | POST | `/indexer/build-docs-from-db` | 仅在测试/调试时使用,根据 `tenant_id + spu_ids` 内部查库并构建 ES doc |
93 -| 内容理解字段生成 | POST | `/indexer/enrich-content` | 根据商品标题批量生成 qanchors、enriched_attributes、tags,供微服务组合方式使用(独立服务端口 6001) |  
94 | 索引健康检查 | GET | `/indexer/health` | 检查索引服务状态 | 93 | 索引健康检查 | GET | `/indexer/health` | 检查索引服务状态 |
95 | 健康检查 | GET | `/admin/health` | 服务健康检查 | 94 | 健康检查 | GET | `/admin/health` | 服务健康检查 |
96 | 获取配置 | GET | `/admin/config` | 获取租户配置 | 95 | 获取配置 | GET | `/admin/config` | 获取租户配置 |
@@ -104,6 +103,8 @@ curl -X POST &quot;http://43.166.252.75:6002/search/&quot; \ @@ -104,6 +103,8 @@ curl -X POST &quot;http://43.166.252.75:6002/search/&quot; \
104 | 向量服务(图片) | 6008 | `POST /embed/image` | 图片向量化 | 103 | 向量服务(图片) | 6008 | `POST /embed/image` | 图片向量化 |
105 | 翻译服务 | 6006 | `POST /translate` | 文本翻译(支持 qwen-mt / llm / deepl / 本地模型) | 104 | 翻译服务 | 6006 | `POST /translate` | 文本翻译(支持 qwen-mt / llm / deepl / 本地模型) |
106 | 重排服务 | 6007 | `POST /rerank` | 检索结果重排 | 105 | 重排服务 | 6007 | `POST /rerank` | 检索结果重排 |
107 -| 内容理解(独立服务) | 6001 | `POST /indexer/enrich-content` | 根据商品标题生成 qanchors、tags 等,供 indexer 微服务组合方式使用 | 106 +---
  107 +
  108 +> 注:`/indexer/enrich-content` 已迁移到独立项目,不再由本仓库的 Indexer 服务提供;本仓库保留 `build-docs` / `build-docs-from-db` 等索引构建接口。
108 109
109 --- 110 ---
docs/搜索API对接指南-05-索引接口(Indexer).md
@@ -13,7 +13,6 @@ @@ -13,7 +13,6 @@
13 | 查询文档 | POST | `/indexer/documents` | 按 SPU ID 列表查询 ES 文档,不写入 ES | 13 | 查询文档 | POST | `/indexer/documents` | 按 SPU ID 列表查询 ES 文档,不写入 ES |
14 | 构建 ES 文档(正式) | POST | `/indexer/build-docs` | 由上游提供 MySQL 行数据,返回 ES-ready 文档,不写 ES | 14 | 构建 ES 文档(正式) | POST | `/indexer/build-docs` | 由上游提供 MySQL 行数据,返回 ES-ready 文档,不写 ES |
15 | 构建 ES 文档(测试) | POST | `/indexer/build-docs-from-db` | 由本服务查库并构建文档,仅测试/调试用 | 15 | 构建 ES 文档(测试) | POST | `/indexer/build-docs-from-db` | 由本服务查库并构建文档,仅测试/调试用 |
16 -| 内容理解字段生成 | POST | `/indexer/enrich-content` | 根据商品标题批量生成 qanchors、enriched_attributes、tags(供微服务组合方式使用;独立服务端口 6001) |  
17 | 索引健康检查 | GET | `/indexer/health` | 检查索引服务与数据库连接状态 | 16 | 索引健康检查 | GET | `/indexer/health` | 检查索引服务与数据库连接状态 |
18 17
19 #### 5.0 支撑外部 indexer 的三种方式 18 #### 5.0 支撑外部 indexer 的三种方式
@@ -22,8 +21,8 @@ @@ -22,8 +21,8 @@
22 21
23 | 方式 | 说明 | 适用场景 | 22 | 方式 | 说明 | 适用场景 |
24 |------|------|----------| 23 |------|------|----------|
25 -| **1)doc 填充接口** | 调用 `POST /indexer/build-docs` 或 `POST /indexer/build-docs-from-db`,由本服务基于 MySQL 行数据构建完整 ES 文档(含多语言、向量、规格等),**不写入 ES**,由调用方自行写入。 | 希望一站式拿到 ES-ready doc,由己方控制写 ES 的时机与索引名。 |  
26 -| **2)微服务组合** | 单独调用**翻译**、**向量化**、**内容理解字段生成**等接口,由 indexer 程序自己组装 doc 并写入 ES。翻译与向量化为独立微服务(见第 7 节);内容理解为 FacetAwareMatching 独立服务接口 `POST /indexer/enrich-content`(端口 6001)。 | 需要灵活编排、或希望将 LLM/向量等耗时步骤与主链路解耦(如异步补齐 qanchors/tags)。 | 24 +| **1)doc 填充接口** | 调用 `POST /indexer/build-docs` 或 `POST /indexer/build-docs-from-db`,由本服务基于 MySQL 行数据构建 ES 文档(含多语言、向量、规格等),**不写入 ES**,由调用方自行写入。 | 希望一站式拿到 ES-ready doc,由己方控制写 ES 的时机与索引名。 |
  25 +| **2)微服务组合** | 单独调用**翻译**、**向量化**、**外部内容理解服务**等能力,由 indexer 程序自己组装 doc 并写入 ES。翻译与向量化见第 7 节;内容理解字段生成已迁移到独立项目,不再由本仓库维护。 | 需要灵活编排、或希望将 LLM/向量等耗时步骤与主链路解耦(如异步补齐 qanchors/tags)。 |
27 | **3)本服务直接写 ES** | 调用全量索引 `POST /indexer/reindex`、增量索引 `POST /indexer/index`(指定 SPU ID 列表),由本服务从 MySQL 拉数并直接写入 ES。 | 自建运维、联调或不需要由 Java 写 ES 的场景。 | 26 | **3)本服务直接写 ES** | 调用全量索引 `POST /indexer/reindex`、增量索引 `POST /indexer/index`(指定 SPU ID 列表),由本服务从 MySQL 拉数并直接写入 ES。 | 自建运维、联调或不需要由 Java 写 ES 的场景。 |
28 27
29 - **方式 1** 与 **方式 2** 下,ES 的写入方均为外部 indexer(或 Java),职责清晰。 28 - **方式 1** 与 **方式 2** 下,ES 的写入方均为外部 indexer(或 Java),职责清晰。
@@ -645,174 +644,20 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \ @@ -645,174 +644,20 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \
645 644
646 返回结构与 `/indexer/build-docs` 相同,可直接用于对比 ES 实际文档或调试字段映射问题。 645 返回结构与 `/indexer/build-docs` 相同,可直接用于对比 ES 实际文档或调试字段映射问题。
647 646
648 -### 5.8 内容理解字段生成接口  
649 -  
650 -- **端点**: `POST /indexer/enrich-content`  
651 -- **服务**: FacetAwareMatching 独立服务(默认端口 **6001**;由 `/data/FacetAwareMatching/scripts/service_ctl.sh` 管理)  
652 -- **描述**: 根据商品内容信息批量生成 **qanchors**(锚文本)、**enriched_attributes**(通用语义属性)、**enriched_tags**(细分标签)、**enriched_taxonomy_attributes**(taxonomy 结构化属性),供外部 indexer 在「微服务组合」方式下自行拼装 doc 时使用。请求以 `items[]` 传入商品内容字段(必填/可选见下表)。接口只暴露商品内容输入,语言选择、分析维度与最终字段结构统一由 FacetAwareMatching 的 `product_enrich` 内部决定;当前返回结果与 `search_products` mapping 保持一致。单次请求在线程池中执行,避免阻塞其他接口。  
653 -  
654 -当前支持的 `category_taxonomy_profile`:  
655 -- `apparel`  
656 -- `3c`  
657 -- `bags`  
658 -- `pet_supplies`  
659 -- `electronics`  
660 -- `outdoor`  
661 -- `home_appliances`  
662 -- `home_living`  
663 -- `wigs`  
664 -- `beauty`  
665 -- `accessories`  
666 -- `toys`  
667 -- `shoes`  
668 -- `sports`  
669 -- `others`  
670 -  
671 -说明:  
672 -- 所有 profile 的 `enriched_taxonomy_attributes.value` 都统一返回 `zh` + `en`。  
673 -- 外部调用 `/indexer/enrich-content` 时,以请求中的 `category_taxonomy_profile` 为准。  
674 -- 若 indexer 内部仍接入内容理解能力,taxonomy profile 请在调用侧显式传入(建议仍以租户行业配置为准)。 647 +### 5.8 内容理解字段生成能力(已迁出)
675 648
676 -#### 请求参数  
677 -  
678 -```json  
679 -{  
680 - "tenant_id": "170",  
681 - "enrichment_scopes": ["generic", "category_taxonomy"],  
682 - "category_taxonomy_profile": "apparel",  
683 - "items": [  
684 - {  
685 - "spu_id": "223167",  
686 - "title": "纯棉短袖T恤 夏季男装",  
687 - "brief": "夏季透气纯棉短袖,舒适亲肤",  
688 - "description": "100%棉,圆领版型,适合日常通勤与休闲穿搭。",  
689 - "image_url": "https://example.com/images/223167.jpg"  
690 - },  
691 - {  
692 - "spu_id": "223168",  
693 - "title": "12PCS Dolls with Bottles",  
694 - "image_url": "https://example.com/images/223168.jpg"  
695 - }  
696 - ]  
697 -}  
698 -``` 649 +`/indexer/enrich-content` 已迁移到独立项目,本仓库当前的 Indexer 服务(默认端口 `6004`)**不再暴露该接口**,也**不再在** `/indexer/build-docs`、`/indexer/build-docs-from-db`、`/indexer/reindex`、`/indexer/index` 的构建链路里内置生成 `qanchors`、`enriched_attributes`、`enriched_tags`、`enriched_taxonomy_attributes`。
699 650
700 -| 参数 | 类型 | 必填 | 默认值 | 说明 |  
701 -|------|------|------|--------|------|  
702 -| `tenant_id` | string | Y | - | 租户 ID。目前仅用于记录日志,不产生实际作用|  
703 -| `enrichment_scopes` | array[string] | N | `["generic", "category_taxonomy"]` | 选择要执行的增强范围。`generic` 生成 `qanchors`/`enriched_tags`/`enriched_attributes`,`category_taxonomy` 生成 `enriched_taxonomy_attributes` |  
704 -| `category_taxonomy_profile` | string | N | `apparel` | 品类 taxonomy profile。支持:`apparel`、`3c`、`bags`、`pet_supplies`、`electronics`、`outdoor`、`home_appliances`、`home_living`、`wigs`、`beauty`、`accessories`、`toys`、`shoes`、`sports`、`others` |  
705 -| `items` | array | Y | - | 待分析列表;**单次最多 50 条** |  
706 -  
707 -`items[]` 字段说明:  
708 -  
709 -| 字段 | 类型 | 必填 | 说明 |  
710 -|------|------|------|------|  
711 -| `spu_id` | string | Y | SPU ID,用于回填结果;目前仅用于记录日志,不产生实际作用|  
712 -| `title` | string | Y | 商品标题 |  
713 -| `image_url` | string | N | 商品主图 URL;当前仅透传,暂未参与 prompt 与缓存键,后续可用于图像/多模态内容理解 |  
714 -| `brief` | string | N | 商品简介/短描述;当前会参与 prompt 与缓存键 |  
715 -| `description` | string | N | 商品详情/长描述;当前会参与 prompt 与缓存键 |  
716 -  
717 -缓存说明:  
718 -  
719 -- 内容缓存按 **增强范围 + taxonomy profile** 拆分;`generic` 与 `category_taxonomy:apparel` 等使用不同缓存命名空间,互不污染、可独立演进。  
720 -- 缓存键由 `analysis_kind + target_lang + prompt/schema 版本指纹 + prompt 输入文本 hash` 构成;对 category taxonomy 来说,profile 会进入 schema 标识与版本指纹。  
721 -- 当前真正参与 prompt 输入的字段是:`title`、`brief`、`description`;这些字段任一变化,都会落到新的缓存 key。  
722 -- `prompt/schema 版本指纹` 会综合 system prompt、shared instruction、localized table headers、result fields、user instruction template 等信息生成;因此只要提示词或输出契约变化,旧缓存会自然失效。  
723 -- `tenant_id`、`spu_id` 只用于请求归属与结果回填,不参与缓存键。  
724 -- 因此,输入内容与 prompt 契约都不变时可跨请求直接命中缓存;任一一侧变化,都会自然落到新的缓存 key。 651 +当前建议的对接方式:
725 652
726 -语言说明: 653 +1. 调用本仓库的 `POST /indexer/build-docs` 或 `POST /indexer/build-docs-from-db` 生成基础 ES 文档。
  654 +2. 调用独立内容理解服务生成 `qanchors` / `enriched_*` 字段。
  655 +3. 由上游索引程序自行合并字段后写入 ES。
727 656
728 -- 接口不接受语言控制参数。  
729 -- 返回哪些语言、返回哪些语义维度,统一由 `indexer.product_enrich` 内部逻辑决定。  
730 -- 当前为了与 `search_products` mapping 对齐,通用增强字段与 taxonomy 字段都统一只返回核心索引语言 `zh`、`en`。 657 +补充说明:
731 658
732 -批量请求建议:  
733 -- **全量**:强烈建议 尽可能 **20 个 SPU/doc** 攒成一个批次后再请求一次。  
734 -- **增量**:可按时效要求设置时间窗口(例如 **5 分钟**),在窗口内尽可能攒到 **20 个**;达到 20 或窗口到期就发送一次请求。  
735 -- 允许超过20,服务内部会拆分成小批次逐个处理。也允许小于20,但是将造成费用和耗时的成本上升,特别是每次请求一个doc的情况。  
736 -  
737 -#### 响应格式  
738 -  
739 -```json  
740 -{  
741 - "tenant_id": "170",  
742 - "enrichment_scopes": ["generic", "category_taxonomy"],  
743 - "category_taxonomy_profile": "apparel",  
744 - "total": 2,  
745 - "results": [  
746 - {  
747 - "spu_id": "223167",  
748 - "qanchors": {  
749 - "zh": ["短袖T恤", "纯棉", "男装", "夏季"],  
750 - "en": ["cotton t-shirt", "short sleeve", "men", "summer"]  
751 - },  
752 - "enriched_tags": {  
753 - "zh": ["纯棉", "短袖", "男装"],  
754 - "en": ["cotton", "short sleeve", "men"]  
755 - },  
756 - "enriched_attributes": [  
757 - { "name": "enriched_tags", "value": { "zh": "纯棉" } },  
758 - { "name": "usage_scene", "value": { "zh": "日常" } },  
759 - { "name": "enriched_tags", "value": { "en": "cotton" } }  
760 - ],  
761 - "enriched_taxonomy_attributes": [  
762 - { "name": "Product Type", "value": { "zh": ["T恤"], "en": ["t-shirt"] } },  
763 - { "name": "Target Gender", "value": { "zh": ["男"], "en": ["men"] } },  
764 - { "name": "Season", "value": { "zh": ["夏季"], "en": ["summer"] } }  
765 - ]  
766 - },  
767 - {  
768 - "spu_id": "223168",  
769 - "qanchors": {  
770 - "en": ["dolls", "toys", "12pcs"]  
771 - },  
772 - "enriched_tags": {  
773 - "en": ["dolls", "toys"]  
774 - },  
775 - "enriched_attributes": [],  
776 - "enriched_taxonomy_attributes": []  
777 - }  
778 - ]  
779 -}  
780 -```  
781 -  
782 -| 字段 | 类型 | 说明 |  
783 -|------|------|------|  
784 -| `enrichment_scopes` | array | 实际执行的增强范围列表 |  
785 -| `category_taxonomy_profile` | string | 实际使用的品类 taxonomy profile |  
786 -| `results` | array | 与请求 `items` 一一对应,每项含 `spu_id`、`qanchors`、`enriched_attributes`、`enriched_tags`、`enriched_taxonomy_attributes` |  
787 -| `results[].qanchors` | object | 与 ES `qanchors` 字段同结构,按语言键返回短语数组 |  
788 -| `results[].enriched_tags` | object | 与 ES `enriched_tags` 字段同结构,按语言键返回标签数组 |  
789 -| `results[].enriched_attributes` | array | 与 ES `enriched_attributes` nested 字段同结构,每项为 `{ "name", "value": { "zh"?: "...", "en"?: "..." } }` |  
790 -| `results[].enriched_taxonomy_attributes` | array | 与 ES `enriched_taxonomy_attributes` nested 字段同结构。每项通常为 `{ "name", "value": { "zh"?: [...], "en"?: [...] } }` |  
791 -| `results[].error` | string | 若该条处理失败(如 LLM 异常),会在此字段返回错误信息 |  
792 -  
793 -**错误响应**:  
794 -- `400`: `items` 为空或超过 50 条  
795 -- `503`: 未配置 `DASHSCOPE_API_KEY`,内容理解服务不可用  
796 -  
797 -#### 请求示例  
798 -  
799 -```bash  
800 -curl -X POST "http://localhost:6001/indexer/enrich-content" \  
801 - -H "Content-Type: application/json" \  
802 - -d '{  
803 - "tenant_id": "163",  
804 - "enrichment_scopes": ["generic", "category_taxonomy"],  
805 - "category_taxonomy_profile": "apparel",  
806 - "items": [  
807 - {  
808 - "spu_id": "223167",  
809 - "title": "纯棉短袖T恤 夏季男装夏季男装",  
810 - "brief": "夏季透气纯棉短袖,舒适亲肤",  
811 - "description": "100%棉,圆领版型,适合日常通勤与休闲穿搭。",  
812 - "image_url": "https://example.com/images/223167.jpg"  
813 - }  
814 - ]  
815 - }'  
816 -``` 659 +- `search_products` mapping 仍保留上述字段,便于独立内容理解服务继续产出并写入。
  660 +- `suggestion` 等消费侧仍可读取 ES 中已有的 `qanchors` 字段;迁移的是“生成实现”,不是字段模型本身。
  661 +- 本文档不再维护独立内容理解服务的请求/响应细节,请以对应独立项目的文档为准。
817 662
818 --- 663 ---
docs/搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md
1 # 搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation) 1 # 搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation)
2 2
3 -本篇覆盖向量服务(Embedding)、重排服务(Reranker)、翻译服务(Translation)以及 Indexer 服务内的内容理解字段生成(原文第 7 章) 3 +本篇覆盖向量服务(Embedding)、重排服务(Reranker)与翻译服务(Translation)。原先收录的 `/indexer/enrich-content` 内容理解接口已迁移到独立项目,不再由本仓库维护
4 4
5 ## 7. 微服务接口(向量、重排、翻译) 5 ## 7. 微服务接口(向量、重排、翻译)
6 6
@@ -438,14 +438,8 @@ curl &quot;http://localhost:6006/health&quot; @@ -438,14 +438,8 @@ curl &quot;http://localhost:6006/health&quot;
438 } 438 }
439 ``` 439 ```
440 440
441 -### 7.4 内容理解字段生成(Indexer 服务内 441 +### 7.4 内容理解字段生成(已迁出
442 442
443 -内容理解字段生成接口部署在 **Indexer 服务**(默认端口 6004)内,与「翻译、向量化」等独立端口微服务并列,供采用**微服务组合**方式的 indexer 调用。  
444 -  
445 -- **Base URL**: Indexer 服务地址,如 `http://localhost:6004`  
446 -- **路径**: `POST /indexer/enrich-content`  
447 -- **说明**: 根据商品标题批量生成 `qanchors`、`enriched_attributes`、`enriched_tags`、`enriched_taxonomy_attributes`,用于拼装 ES 文档。支持通过 `enrichment_scopes` 选择执行 `generic` / `category_taxonomy`,并通过 `category_taxonomy_profile` 选择对应大类的 taxonomy prompt/profile;默认执行 `generic + category_taxonomy(apparel)`。当前支持的 taxonomy profile 包括 `apparel`、`3c`、`bags`、`pet_supplies`、`electronics`、`outdoor`、`home_appliances`、`home_living`、`wigs`、`beauty`、`accessories`、`toys`、`shoes`、`sports`、`others`。所有 profile 的 taxonomy 输出都统一返回 `zh` + `en`,`category_taxonomy_profile` 只决定字段集合。内部使用大模型(需配置 `DASHSCOPE_API_KEY`),支持多语言与 Redis 缓存;单次最多 50 条,建议批量调用以提升效率。  
448 -  
449 -请求/响应格式、示例及错误码见 [-05-索引接口(Indexer)](./搜索API对接指南-05-索引接口(Indexer).md#58-内容理解字段生成接口)。 443 +`/indexer/enrich-content` 已迁移到独立项目,不再属于本仓库的微服务接口集合。当前仓库中的 Indexer 服务(`6004`)不再提供该接口;如需 `qanchors` / `enriched_*` 字段,请接入对应独立服务,并与本仓库的 `build-docs` 输出在上游侧自行合并。
450 444
451 --- 445 ---
docs/缓存与Redis使用说明.md
@@ -4,7 +4,6 @@ @@ -4,7 +4,6 @@
4 4
5 - **文本向量缓存**(embedding 缓存) 5 - **文本向量缓存**(embedding 缓存)
6 - **翻译结果缓存**(Qwen-MT 等机器翻译) 6 - **翻译结果缓存**(Qwen-MT 等机器翻译)
7 -- **商品内容理解缓存**(锚文本 / 语义属性 / 标签)  
8 7
9 底层连接配置统一来自 `config/env_config.py` 的 `REDIS_CONFIG`: 8 底层连接配置统一来自 `config/env_config.py` 的 `REDIS_CONFIG`:
10 9
@@ -21,8 +20,6 @@ @@ -21,8 +20,6 @@
21 |------------|----------|----------------|----------|------| 20 |------------|----------|----------------|----------|------|
22 | 向量缓存(text/image embedding) | 文本:`{EMBEDDING_CACHE_PREFIX}:embed:norm{0|1}:{text}`;图片:`{EMBEDDING_CACHE_PREFIX}:image:embed:norm{0|1}:{url_or_path}` | **BF16 bytes**(每维 2 字节大端存储),读取后恢复为 `np.float32` | TTL=`REDIS_CONFIG["cache_expire_days"]` 天;访问时滑动过期 | 见 `embeddings/text_encoder.py`、`embeddings/image_encoder.py`、`embeddings/server.py`;前缀由 `REDIS_CONFIG["embedding_cache_prefix"]` 控制 | 21 | 向量缓存(text/image embedding) | 文本:`{EMBEDDING_CACHE_PREFIX}:embed:norm{0|1}:{text}`;图片:`{EMBEDDING_CACHE_PREFIX}:image:embed:norm{0|1}:{url_or_path}` | **BF16 bytes**(每维 2 字节大端存储),读取后恢复为 `np.float32` | TTL=`REDIS_CONFIG["cache_expire_days"]` 天;访问时滑动过期 | 见 `embeddings/text_encoder.py`、`embeddings/image_encoder.py`、`embeddings/server.py`;前缀由 `REDIS_CONFIG["embedding_cache_prefix"]` 控制 |
23 | 翻译结果缓存(translator service) | `trans:{model}:{target_lang}:{source_text[:4]}{sha256(source_text)}` | 机翻后的单条字符串 | TTL=`services.translation.cache.ttl_seconds` 秒;可配置滑动过期 | 见 `translation/service.py` + `config/config.yaml` | 22 | 翻译结果缓存(translator service) | `trans:{model}:{target_lang}:{source_text[:4]}{sha256(source_text)}` | 机翻后的单条字符串 | TTL=`services.translation.cache.ttl_seconds` 秒;可配置滑动过期 | 见 `translation/service.py` + `config/config.yaml` |
24 -| 商品内容理解缓存(anchors / 语义属性 / tags) | `{ANCHOR_CACHE_PREFIX}:{tenant_or_global}:{target_lang}:{md5(title)}` | `json.dumps(dict)`,包含 id/title/category/tags/anchor_text 等 | TTL=`ANCHOR_CACHE_EXPIRE_DAYS` 天 | 见 `indexer/product_enrich.py` |  
25 -  
26 下面按模块详细说明。 23 下面按模块详细说明。
27 24
28 --- 25 ---
@@ -186,69 +183,9 @@ services: @@ -186,69 +183,9 @@ services:
186 183
187 --- 184 ---
188 185
189 -## 4. 商品内容理解缓存(indexer/product_enrich.py)  
190 -  
191 -- **代码位置**:`indexer/product_enrich.py`  
192 -- **用途**:在生成商品锚文本(qanchors)、语义属性、标签等内容理解结果时复用缓存,避免对同一标题重复调用大模型。  
193 -  
194 -### 4.1 Key 设计  
195 -  
196 -- 配置项:  
197 - - `ANCHOR_CACHE_PREFIX = REDIS_CONFIG.get("anchor_cache_prefix", "product_anchors")`  
198 - - `ANCHOR_CACHE_EXPIRE_DAYS = int(REDIS_CONFIG.get("anchor_cache_expire_days", 30))`  
199 -- Key 构造函数:`_make_analysis_cache_key(product, target_lang, analysis_kind)`  
200 -- 模板:  
201 -  
202 -```text  
203 -{ANCHOR_CACHE_PREFIX}:{analysis_kind}:{prompt_contract_hash}:{target_lang}:{prompt_input_prefix}{md5(prompt_input)}  
204 -```  
205 -  
206 -- 字段说明:  
207 - - `ANCHOR_CACHE_PREFIX`:默认 `"product_anchors"`,可通过 `.env` 中的 `REDIS_ANCHOR_CACHE_PREFIX`(若存在)间接配置到 `REDIS_CONFIG`;  
208 - - `analysis_kind`:分析族,目前至少包括 `content` 与 `taxonomy`,两者缓存隔离;  
209 - - `prompt_contract_hash`:基于 system prompt、shared instruction、localized headers、result fields、user instruction template、schema cache version 等生成的短 hash;只要提示词或输出契约变化,缓存会自动失效;  
210 - - `target_lang`:内容理解输出语言,例如 `zh`;  
211 - - `prompt_input_prefix + md5(prompt_input)`:对真正送入 prompt 的商品文本做前缀 + MD5;当前 prompt 输入来自 `title`、`brief`、`description` 的规范化拼接结果。  
212 -  
213 -设计原则:  
214 -  
215 -- 只让**实际影响 LLM 输出**的输入参与 key;  
216 -- 不让 `tenant_id`、`spu_id` 这类“结果归属信息”污染缓存;  
217 -- prompt 或 schema 变更时,不依赖人工清理 Redis,也能自然切换到新 key。  
218 -  
219 -### 4.2 Value 与类型  
220 -  
221 -- 类型:`json.dumps(dict, ensure_ascii=False)`。  
222 -- 典型结构(简化):  
223 -  
224 -```json  
225 -{  
226 - "id": "123",  
227 - "lang": "zh",  
228 - "title_input": "原始标题",  
229 - "title": "归一化后的商品标题",  
230 - "category_path": "...",  
231 - "tags": "...",  
232 - "target_audience": "...",  
233 - "usage_scene": "...",  
234 - "anchor_text": "..., ..."  
235 -}  
236 -```  
237 -  
238 -- 读取时通过 `json.loads(raw)` 还原为 `Dict[str, Any]`。  
239 -- `content` 与 `taxonomy` 的 value 结构会随各自 schema 不同而不同,但都会先通过统一的 normalize 逻辑再写缓存。  
240 -  
241 -### 4.3 过期策略  
242 -  
243 -- TTL:`ttl = ANCHOR_CACHE_EXPIRE_DAYS * 24 * 3600` 秒(默认 30 天);  
244 -- 写入:`redis.setex(key, ttl, json.dumps(result, ensure_ascii=False))`;  
245 -- 读取:仅做 `redis.get(key)`,**不做滑动过期**。  
246 -  
247 -### 4.4 调用流程中的位置 186 +## 4. 商品内容理解缓存(已迁出)
248 187
249 -- 单条调用(索引阶段常见)时,`analyze_products()` 会先尝试命中缓存:  
250 - - 若命中,直接返回缓存结果;  
251 - - 若 miss,调用 LLM,解析结果后再写入缓存。 188 +本仓库原先存在一套用于 `qanchors` / `enriched_*` 生成的 Redis 缓存实现,但对应内容理解服务已经迁移到独立项目,当前仓库代码中不再读写这类缓存,也不再把它作为运行时能力的一部分维护。
252 189
253 --- 190 ---
254 191
@@ -258,24 +195,24 @@ services: @@ -258,24 +195,24 @@ services:
258 195
259 ### 5.1 redis_cache_health_check.py(缓存健康巡检) 196 ### 5.1 redis_cache_health_check.py(缓存健康巡检)
260 197
261 -**功能**:按**业务缓存类型**(embedding / translation / anchors)做健康巡检,不扫全库。 198 +**功能**:按**业务缓存类型**(embedding / translation)做健康巡检,不扫全库。
262 199
263 - 对每类缓存:SCAN 匹配对应 key 前缀,统计**匹配 key 数量**(受 `--max-scan` 上限约束); 200 - 对每类缓存:SCAN 匹配对应 key 前缀,统计**匹配 key 数量**(受 `--max-scan` 上限约束);
264 - **TTL 分布**:对采样 key 统计 `no-expire-or-expired` / `0-1h` / `1h-1d` / `1d-30d` / `>30d`; 201 - **TTL 分布**:对采样 key 统计 `no-expire-or-expired` / `0-1h` / `1h-1d` / `1d-30d` / `>30d`;
265 - **近期活跃 key**:从采样中选出 `OBJECT IDLETIME <= 600s` 的 key,用于判断是否有新写入; 202 - **近期活跃 key**:从采样中选出 `OBJECT IDLETIME <= 600s` 的 key,用于判断是否有新写入;
266 -- **样本 key 与 value 预览**:对 embedding 显示 ndarray 信息,对 translation 显示译文片段,对 anchors 显示 JSON 摘要 203 +- **样本 key 与 value 预览**:对 embedding 显示 ndarray 信息,对 translation 显示译文片段
267 204
268 -**适用场景**:日常查看类缓存是否在增长、TTL 是否合理、是否有近期写入;与「缓存总览表」中的 key 设计一一对应。 205 +**适用场景**:日常查看类缓存是否在增长、TTL 是否合理、是否有近期写入;与「缓存总览表」中的 key 设计一一对应。
269 206
270 **用法示例**: 207 **用法示例**:
271 208
272 ```bash 209 ```bash
273 -# 默认:检查 embedding / translation / anchors 三 210 +# 默认:检查 embedding / translation
274 python scripts/redis/redis_cache_health_check.py 211 python scripts/redis/redis_cache_health_check.py
275 212
276 -# 只检查某一类或两类 213 +# 只检查某一类
277 python scripts/redis/redis_cache_health_check.py --type embedding 214 python scripts/redis/redis_cache_health_check.py --type embedding
278 -python scripts/redis/redis_cache_health_check.py --type translation anchors 215 +python scripts/redis/redis_cache_health_check.py --type translation
279 216
280 # 按自定义 pattern 检查(不按业务类型) 217 # 按自定义 pattern 检查(不按业务类型)
281 python scripts/redis/redis_cache_health_check.py --pattern "mycache:*" 218 python scripts/redis/redis_cache_health_check.py --pattern "mycache:*"
@@ -288,7 +225,7 @@ python scripts/redis/redis_cache_health_check.py --sample-size 100 --max-scan 50 @@ -288,7 +225,7 @@ python scripts/redis/redis_cache_health_check.py --sample-size 100 --max-scan 50
288 225
289 | 参数 | 说明 | 默认 | 226 | 参数 | 说明 | 默认 |
290 |------|------|------| 227 |------|------|------|
291 -| `--type` | 缓存类型:`embedding` / `translation` / `anchors`,可多选 | 三类都检查 | 228 +| `--type` | 缓存类型:`embedding` / `translation`,可多选 | 两类都检查 |
292 | `--pattern` | 自定义 key pattern(如 `mycache:*`),指定后忽略 `--type` | - | 229 | `--pattern` | 自定义 key pattern(如 `mycache:*`),指定后忽略 `--type` | - |
293 | `--db` | Redis 数据库编号 | 0 | 230 | `--db` | Redis 数据库编号 | 0 |
294 | `--sample-size` | 每类采样的 key 数量 | 50 | 231 | `--sample-size` | 每类采样的 key 数量 | 50 |
@@ -319,7 +256,7 @@ python scripts/redis/redis_cache_prefix_stats.py --all-db @@ -319,7 +256,7 @@ python scripts/redis/redis_cache_prefix_stats.py --all-db
319 python scripts/redis/redis_cache_prefix_stats.py --db 1 256 python scripts/redis/redis_cache_prefix_stats.py --db 1
320 257
321 # 只统计指定前缀(可多个) 258 # 只统计指定前缀(可多个)
322 -python scripts/redis/redis_cache_prefix_stats.py --prefix trans embedding product_anchors 259 +python scripts/redis/redis_cache_prefix_stats.py --prefix trans embedding
323 260
324 # 全 DB + 指定前缀 261 # 全 DB + 指定前缀
325 python scripts/redis/redis_cache_prefix_stats.py --all-db --prefix trans embedding 262 python scripts/redis/redis_cache_prefix_stats.py --all-db --prefix trans embedding
@@ -369,7 +306,7 @@ python scripts/redis/redis_memory_heavy_keys.py --top 100 @@ -369,7 +306,7 @@ python scripts/redis/redis_memory_heavy_keys.py --top 100
369 306
370 | 需求 | 推荐脚本 | 307 | 需求 | 推荐脚本 |
371 |------|----------| 308 |------|----------|
372 -| 看三类业务缓存(embedding/translation/anchors)的数量、TTL、近期写入、样本 value | `redis_cache_health_check.py` | 309 +| 看两类业务缓存(embedding/translation)的数量、TTL、近期写入、样本 value | `redis_cache_health_check.py` |
373 | 看全库或某前缀的 key 条数与内存占比 | `redis_cache_prefix_stats.py` | 310 | 看全库或某前缀的 key 条数与内存占比 | `redis_cache_prefix_stats.py` |
374 | 找占用内存最多的大 key、分析内存差异 | `redis_memory_heavy_keys.py` | 311 | 找占用内存最多的大 key、分析内存差异 | `redis_memory_heavy_keys.py` |
375 312
indexer/ANCHORS_AND_SEMANTIC_ATTRIBUTES.md
1 -## qanchors 与 enriched_attributes 设计与索引逻辑说明 1 +# qanchors 与 enriched_* 字段说明
2 2
3 -本文档详细说明: 3 +本文档原先记录本仓库内的内容理解实现细节。自 2026-04 起,这部分生成能力已经迁移到独立项目,本仓库不再维护 `/indexer/enrich-content` 路由,也不再在 indexer 构建链路内自动补齐这些字段。
4 4
5 -- **锚文本字段 `qanchors.{lang}` 的作用与来源**  
6 -- **语义属性字段 `enriched_attributes` 的结构、用途与写入流程**  
7 -- **多语言支持策略(zh / en / de / ru / fr)**  
8 -- **索引阶段与 LLM 调用的集成方式** 5 +当前状态:
9 6
10 -本设计已默认开启,无需额外开关;在上游 LLM 不可用时会自动降级为“无锚点/语义属性”,不影响主索引流程。  
11 -  
12 ----  
13 -  
14 -### 1. 字段设计概览  
15 -  
16 -#### 1.1 `qanchors.{lang}`:面向查询的锚文本  
17 -  
18 -- **Mapping 位置**:`mappings/search_products.json` 中的 `qanchors` 对象。  
19 -- **结构**(与 `title.{lang}` 一致):  
20 -  
21 -```140:182:/home/tw/saas-search/mappings/search_products.json  
22 -"qanchors": {  
23 - "type": "object",  
24 - "properties": {  
25 - "zh": { "type": "text", "analyzer": "index_ik", "search_analyzer": "query_ik" },  
26 - "en": { "type": "text", "analyzer": "english" },  
27 - "de": { "type": "text", "analyzer": "german" },  
28 - "ru": { "type": "text", "analyzer": "russian" },  
29 - "fr": { "type": "text", "analyzer": "french" },  
30 - ...  
31 - }  
32 -}  
33 -```  
34 -  
35 -- **语义**:  
36 - 用于承载“更接近用户自然搜索行为”的词/短语(query-style anchors),包括:  
37 - - 品类 + 细分类别表达;  
38 - - 使用场景(通勤、约会、度假、office outfit 等);  
39 - - 适用人群(年轻女性、plus size、teen boys 等);  
40 - - 材质 / 关键属性 / 功能特点等。  
41 -  
42 -- **使用场景**:  
43 - - 主搜索:作为额外的全文字段参与 BM25 召回与打分(可在 `search/query_config.py` 中给一定权重);  
44 - - Suggestion:`suggestion/builder.py` 会从 `qanchors.{lang}` 中拆分词条作为候选(`source="qanchor"`,权重大于 `title`)。  
45 -  
46 -#### 1.2 `enriched_attributes`:面向过滤/分面的通用语义属性  
47 -  
48 -- **Mapping 位置**:`mappings/search_products.json`,追加的 nested 字段。  
49 -- **结构**:  
50 -  
51 -```1392:1410:/home/tw/saas-search/mappings/search_products.json  
52 -"enriched_attributes": {  
53 - "type": "nested",  
54 - "properties": {  
55 - "lang": { "type": "keyword" }, // 语言:zh / en / de / ru / fr  
56 - "name": { "type": "keyword" }, // 维度名:usage_scene / target_audience / material / ...  
57 - "value": { "type": "keyword" } // 维度值:通勤 / office / Baumwolle ...  
58 - }  
59 -}  
60 -```  
61 -  
62 -- **语义**:  
63 - - 将 LLM 输出的各维度信息统一规约到 `name/value/lang` 三元组;  
64 - - 维度名稳定、值内容可变,便于后续扩展新的语义维度而不需要修改 mapping。  
65 -  
66 -- **当前支持的维度名**(在 `document_transformer.py` 中固定列表):  
67 - - `tags`:细分标签/风格标签;  
68 - - `target_audience`:适用人群;  
69 - - `usage_scene`:使用场景;  
70 - - `season`:适用季节;  
71 - - `key_attributes`:关键属性;  
72 - - `material`:材质说明;  
73 - - `features`:功能特点。  
74 -  
75 -- **使用场景**:  
76 - - 按语义维度过滤:  
77 - - 例:只要“适用人群=年轻女性”的商品;  
78 - - 例:`usage_scene` 包含 “office” 或 “通勤”。  
79 - - 按语义维度分面 / 展示筛选项:  
80 - - 例:展示当前结果中所有 `usage_scene` 的分布,供前端勾选;  
81 - - 例:展示所有 `material` 值 + 命中文档数。  
82 -  
83 ----  
84 -  
85 -### 2. LLM 分析服务:`indexer/product_annotator.py`  
86 -  
87 -#### 2.1 入口函数:`analyze_products`  
88 -  
89 -- **文件**:`indexer/product_annotator.py`  
90 -- **函数签名**:  
91 -  
92 -```365:392:/home/tw/saas-search/indexer/product_annotator.py  
93 -def analyze_products(  
94 - products: List[Dict[str, str]],  
95 - target_lang: str = "zh",  
96 - batch_size: Optional[int] = None,  
97 -) -> List[Dict[str, Any]]:  
98 - """  
99 - 库调用入口:根据输入+语言,返回锚文本及各维度信息。  
100 -  
101 - Args:  
102 - products: [{"id": "...", "title": "..."}]  
103 - target_lang: 输出语言,需在 SUPPORTED_LANGS 内  
104 - batch_size: 批大小,默认使用全局 BATCH_SIZE  
105 - """  
106 - ...  
107 -```  
108 -  
109 -- **支持的输出语言**(在同文件中定义):  
110 -  
111 -```54:62:/home/tw/saas-search/indexer/product_annotator.py  
112 -LANG_LABELS: Dict[str, str] = {  
113 - "zh": "中文",  
114 - "en": "英文",  
115 - "de": "德文",  
116 - "ru": "俄文",  
117 - "fr": "法文",  
118 -}  
119 -SUPPORTED_LANGS = set(LANG_LABELS.keys())  
120 -```  
121 -  
122 -- **返回结构**(每个商品一条记录):  
123 -  
124 -```python  
125 -{  
126 - "id": "<SPU_ID>",  
127 - "lang": "<zh|en|de|ru|fr>",  
128 - "title_input": "<原始输入标题>",  
129 - "title": "<目标语言的标题>",  
130 - "category_path": "<LLM 生成的品类路径>",  
131 - "tags": "<逗号分隔的细分标签>",  
132 - "target_audience": "<逗号分隔的适用人群>",  
133 - "usage_scene": "<逗号分隔的使用场景>",  
134 - "season": "<逗号分隔的适用季节>",  
135 - "key_attributes": "<逗号分隔的关键属性>",  
136 - "material": "<逗号分隔的材质说明>",  
137 - "features": "<逗号分隔的功能特点>",  
138 - "anchor_text": "<逗号分隔的锚文本短语>",  
139 - # 若发生错误,还会附带:  
140 - # "error": "<异常信息>"  
141 -}  
142 -```  
143 -  
144 -> 注意:表格中的多值字段(标签/场景/人群/材质等)约定为**使用逗号分隔**,后续索引端会统一按正则 `[,;|/\\n\\t]+` 再拆分为短语。  
145 -  
146 -#### 2.2 Prompt 设计与语言控制  
147 -  
148 -- Prompt 中会明确要求“**所有输出内容使用目标语言**”,并给出中英文示例:  
149 -  
150 -```65:81:/home/tw/saas-search/indexer/product_annotator.py  
151 -def create_prompt(products: List[Dict[str, str]], target_lang: str = "zh") -> str:  
152 - """创建LLM提示词(根据目标语言输出)"""  
153 - lang_label = LANG_LABELS.get(target_lang, "对应语言")  
154 - prompt = f"""请对输入的每条商品标题,分析并提取以下信息,所有输出内容请使用{lang_label}:  
155 -  
156 -1. 商品标题:将输入商品名称翻译为{lang_label}  
157 -2. 品类路径:从大类到细分品类,用">"分隔(例如:服装>女装>裤子>工装裤)  
158 -3. 细分标签:商品的风格、特点、功能等(例如:碎花,收腰,法式)  
159 -4. 适用人群:性别/年龄段等(例如:年轻女性)  
160 -5. 使用场景  
161 -6. 适用季节  
162 -7. 关键属性  
163 -8. 材质说明  
164 -9. 功能特点  
165 -10. 商品卖点:分析和提取一句话核心卖点,用于推荐理由  
166 -11. 锚文本:生成一组能够代表该商品、并可能被用户用于搜索的词语或短语。这些词语应覆盖用户需求的各个维度,如品类、细分标签、功能特性、需求场景等等。  
167 -"""  
168 -```  
169 -  
170 -- 返回格式固定为 Markdown 表格,首行头为:  
171 -  
172 -```89:91:/home/tw/saas-search/indexer/product_annotator.py  
173 -| 序号 | 商品标题 | 品类路径 | 细分标签 | 适用人群 | 使用场景 | 适用季节 | 关键属性 | 材质说明 | 功能特点 | 商品卖点 | 锚文本 |  
174 -|----|----|----|----|----|----|----|----|----|----|----|----|  
175 -```  
176 -  
177 -`parse_markdown_table` 会按表格列顺序解析成字段。  
178 -  
179 ----  
180 -  
181 -### 3. 索引阶段集成:`SPUDocumentTransformer._fill_llm_attributes`  
182 -  
183 -#### 3.1 调用时机  
184 -  
185 -在 `SPUDocumentTransformer.transform_spu_to_doc(...)` 的末尾,在所有基础字段(多语言文本、类目、SKU/规格、价格、库存等)填充完成后,会调用:  
186 -  
187 -```96:101:/home/tw/saas-search/indexer/document_transformer.py  
188 - # 文本字段处理(翻译等)  
189 - self._fill_text_fields(doc, spu_row, primary_lang)  
190 -  
191 - # 标题向量化  
192 - if self.enable_title_embedding and self.encoder:  
193 - self._fill_title_embedding(doc)  
194 - ...  
195 - # 时间字段  
196 - ...  
197 -  
198 - # 基于 LLM 的锚文本与语义属性(默认开启,失败时仅记录日志)  
199 - self._fill_llm_attributes(doc, spu_row)  
200 -```  
201 -  
202 -也就是说,**每个 SPU 文档默认会尝试补充 qanchors 与 enriched_attributes**。  
203 -  
204 -#### 3.2 语言选择策略  
205 -  
206 -在 `_fill_llm_attributes` 内部:  
207 -  
208 -```148:164:/home/tw/saas-search/indexer/document_transformer.py  
209 - try:  
210 - index_langs = self.tenant_config.get("index_languages") or ["en", "zh"]  
211 - except Exception:  
212 - index_langs = ["en", "zh"]  
213 -  
214 - # 只在支持的语言集合内调用  
215 - llm_langs = [lang for lang in index_langs if lang in SUPPORTED_LANGS]  
216 - if not llm_langs:  
217 - return  
218 -```  
219 -  
220 -- `tenant_config.index_languages` 决定该租户希望在索引中支持哪些语言;  
221 -- 实际调用 LLM 的语言集合 = `index_languages ∩ SUPPORTED_LANGS`;  
222 -- 当前 SUPPORTED_LANGS:`{"zh", "en", "de", "ru", "fr"}`。  
223 -  
224 -这保证了:  
225 -  
226 -- 如果租户只索引 `zh`,就只跑中文;  
227 -- 如果租户同时索引 `en` + `de`,就为这两种语言各跑一次 LLM;  
228 -- 如果 `index_languages` 里包含暂不支持的语言(例如 `es`),会被自动忽略。  
229 -  
230 -#### 3.3 调用 LLM 并写入字段  
231 -  
232 -核心逻辑(简化描述):  
233 -  
234 -```164:210:/home/tw/saas-search/indexer/document_transformer.py  
235 - spu_id = str(spu_row.get("id") or "").strip()  
236 - title = str(spu_row.get("title") or "").strip()  
237 - if not spu_id or not title:  
238 - return  
239 -  
240 - semantic_list = doc.get("enriched_attributes") or []  
241 - qanchors_obj = doc.get("qanchors") or {}  
242 -  
243 - dim_keys = [  
244 - "tags",  
245 - "target_audience",  
246 - "usage_scene",  
247 - "season",  
248 - "key_attributes",  
249 - "material",  
250 - "features",  
251 - ]  
252 -  
253 - for lang in llm_langs:  
254 - try:  
255 - rows = analyze_products(  
256 - products=[{"id": spu_id, "title": title}],  
257 - target_lang=lang,  
258 - batch_size=1,  
259 - )  
260 - except Exception as e:  
261 - logger.warning("LLM attribute fill failed for SPU %s, lang=%s: %s", spu_id, lang, e)  
262 - continue  
263 -  
264 - if not rows:  
265 - continue  
266 - row = rows[0] or {}  
267 -  
268 - # qanchors.{lang}  
269 - anchor_text = str(row.get("anchor_text") or "").strip()  
270 - if anchor_text:  
271 - qanchors_obj[lang] = anchor_text  
272 -  
273 - # 语义属性  
274 - for name in dim_keys:  
275 - raw = row.get(name)  
276 - if not raw:  
277 - continue  
278 - parts = re.split(r"[,;|/\n\t]+", str(raw))  
279 - for part in parts:  
280 - value = part.strip()  
281 - if not value:  
282 - continue  
283 - semantic_list.append(  
284 - {  
285 - "lang": lang,  
286 - "name": name,  
287 - "value": value,  
288 - }  
289 - )  
290 -  
291 - if qanchors_obj:  
292 - doc["qanchors"] = qanchors_obj  
293 - if semantic_list:  
294 - doc["enriched_attributes"] = semantic_list  
295 -```  
296 -  
297 -要点:  
298 -  
299 -- 每种语言**单独调用一次** `analyze_products`,传入同一 SPU 的原始标题;  
300 -- 将返回的 `anchor_text` 直接写入 `qanchors.{lang}`,其内部仍是逗号分隔短语,后续 suggestion builder 会再拆分;  
301 -- 对各维度字段(tags/usage_scene/...)用统一正则进行“松散拆词”,过滤空串后,以 `(lang,name,value)` 三元组追加到 nested 数组;  
302 -- 如果某个维度在该语言下为空,则跳过,不写入任何条目。  
303 -  
304 -#### 3.4 容错 & 降级策略  
305 -  
306 -- 如果:  
307 - - 没有 `title`;  
308 - - 或者 `tenant_config.index_languages` 与 `SUPPORTED_LANGS` 没有交集;  
309 - - 或 `DASHSCOPE_API_KEY` 未配置 / LLM 请求报错;  
310 -- 则 `_fill_llm_attributes` 会在日志中输出 `warning`,**不会抛异常**,索引流程继续,只是该 SPU 在这一轮不会得到 `qanchors` / `enriched_attributes`。  
311 -  
312 -这保证了整个索引服务在 LLM 不可用时表现为一个普通的“传统索引”,而不会中断。  
313 -  
314 ----  
315 -  
316 -### 4. 查询与 Suggestion 中的使用建议  
317 -  
318 -#### 4.1 主搜索(Search API)  
319 -  
320 -在 `search/query_config.py` 或构建 ES 查询时,可以:  
321 -  
322 -- 将 `qanchors.{lang}` 作为额外的 `should` 字段参与匹配,并给一个略高的权重,例如:  
323 -  
324 -```json  
325 -{  
326 - "multi_match": {  
327 - "query": "<user_query>",  
328 - "fields": [  
329 - "title.zh^3.0",  
330 - "brief.zh^1.5",  
331 - "description.zh^1.0",  
332 - "vendor.zh^1.5",  
333 - "category_path.zh^1.5",  
334 - "category_name_text.zh^1.5",  
335 - "tags^1.0",  
336 - "qanchors.zh^2.0" // 建议新增  
337 - ]  
338 - }  
339 -}  
340 -```  
341 -  
342 -- 当用户做维度过滤时(例如“只看通勤场景 + 夏季 + 棉质”),可以在 filter 中增加 nested 查询:  
343 -  
344 -```json  
345 -{  
346 - "nested": {  
347 - "path": "enriched_attributes",  
348 - "query": {  
349 - "bool": {  
350 - "must": [  
351 - { "term": { "enriched_attributes.lang": "zh" } },  
352 - { "term": { "enriched_attributes.name": "usage_scene" } },  
353 - { "term": { "enriched_attributes.value": "通勤" } }  
354 - ]  
355 - }  
356 - }  
357 - }  
358 -}  
359 -```  
360 -  
361 -多个维度可以通过多个 nested 子句组合(AND/OR 逻辑与 `specifications` 的设计类似)。  
362 -  
363 -#### 4.2 Suggestion(联想词)  
364 -  
365 -现有 `suggestion/builder.py` 已经支持从 `qanchors.{lang}` 中提取候选:  
366 -  
367 -```249:287:/home/tw/saas-search/suggestion/builder.py  
368 - # Step 1: product title/qanchors  
369 - hits = self._scan_products(tenant_id, batch_size=batch_size)  
370 - ...  
371 - title_obj = src.get("title") or {}  
372 - qanchor_obj = src.get("qanchors") or {}  
373 - ...  
374 - for lang in index_languages:  
375 - ...  
376 - q_raw = None  
377 - if isinstance(qanchor_obj, dict):  
378 - q_raw = qanchor_obj.get(lang)  
379 - for q_text in self._split_qanchors(q_raw):  
380 - text_norm = self._normalize_text(q_text)  
381 - if self._looks_noise(text_norm):  
382 - continue  
383 - key = (lang, text_norm)  
384 - c = key_to_candidate.get(key)  
385 - if c is None:  
386 - c = SuggestionCandidate(text=q_text, text_norm=text_norm, lang=lang)  
387 - key_to_candidate[key] = c  
388 - c.add_product("qanchor", spu_id=spu_id, score=product_score + 0.6)  
389 -```  
390 -  
391 -- `_split_qanchors` 使用与索引端一致的分隔符集合,确保:  
392 - - 无论 LLM 用逗号、分号还是换行分隔,只要符合约定,都能被拆成单独候选词;  
393 -- `add_product("qanchor", ...)` 会:  
394 - - 将来源标记为 `qanchor`;  
395 - - 在排序打分时,`qanchor` 命中会比纯 `title` 更有权重。  
396 -  
397 ----  
398 -  
399 -### 5. 总结与扩展方向  
400 -  
401 -1. **功能定位**:  
402 - - `qanchors.{lang}`:更好地贴近用户真实查询词,用于召回与 suggestion;  
403 - - `enriched_attributes`:以结构化形式承载 LLM 抽取的语义维度,用于 filter / facet。  
404 -2. **多语言对齐**:  
405 - - 完全复用租户级 `index_languages` 配置;  
406 - - 对每种语言单独生成锚文本与语义属性,不互相混用。  
407 -3. **默认开启 / 自动降级**:  
408 - - 索引流程始终可用;  
409 - - 当 LLM/配置异常时,只是“缺少增强特征”,不影响基础搜索能力。  
410 -4. **未来扩展**:  
411 - - 可以在 `dim_keys` 中新增维度名(如 `style`, `benefit` 等),只要在 prompt 与解析逻辑中增加对应列即可;  
412 - - 可以为 `enriched_attributes` 增加额外字段(如 `confidence`、`source`),用于更精细的控制(当前 mapping 为简单版)。  
413 -  
414 -如需在查询层面增加基于 `enriched_attributes` 的统一 DSL(类似 `specifications` 的过滤/分面规则),推荐在 `docs/搜索API对接指南-01-搜索接口.md` 或 `docs/搜索API对接指南-08-数据模型与字段速查.md` 中新增一节,并在 `search/es_query_builder.py` 里封装构造逻辑,避免前端直接拼 nested 查询。 7 +- `search_products` mapping 仍保留 `qanchors`、`enriched_attributes`、`enriched_tags`、`enriched_taxonomy_attributes` 字段,便于外部服务继续产出并写入。
  8 +- `suggestion/builder.py` 等消费侧仍会读取 ES 中已有的 `qanchors`。
  9 +- `/indexer/build-docs`、`/indexer/build-docs-from-db`、`/indexer/reindex`、`/indexer/index` 只负责基础文档构建,不再调用本地 LLM 富化。
415 10
  11 +如需这些字段,请在独立内容理解服务中生成,并由上游索引程序自行合并到最终 ES 文档。
@@ -67,7 +67,7 @@ @@ -67,7 +67,7 @@
67 67
68 - ES 文档结构 `ProductIndexDocument` 的字段细节(title/brief/description/vendor/category_xxx/tags/specifications/skus/embedding 等)。 68 - ES 文档结构 `ProductIndexDocument` 的字段细节(title/brief/description/vendor/category_xxx/tags/specifications/skus/embedding 等)。
69 - 翻译ã€å‘é‡ç­‰å…·ä½“算法逻辑。 69 - 翻译ã€å‘é‡ç­‰å…·ä½“算法逻辑。
70 -- qanchors/keywords 等新特å¾çš„计算。 70 +- `qanchors` 等外部内容ç†è§£å­—段的生æˆã€‚
71 71
72 **æ–°èŒè´£è¾¹ç•Œ**: 72 **æ–°èŒè´£è¾¹ç•Œ**:
73 Java åªè´Ÿè´£â€œ**选出è¦ç´¢å¼•çš„ SPU + 从 MySQL 拉å–åŽŸå§‹æ•°æ® + 调用 Python æœåŠ¡**(或交给 Python åšå®Œæ•´ç´¢å¼•)â€ã€‚ 73 Java åªè´Ÿè´£â€œ**选出è¦ç´¢å¼•çš„ SPU + 从 MySQL 拉å–åŽŸå§‹æ•°æ® + 调用 Python æœåŠ¡**(或交给 Python åšå®Œæ•´ç´¢å¼•)â€ã€‚
@@ -81,7 +81,7 @@ Java åªè´Ÿè´£â€œ**选出è¦ç´¢å¼•çš„ SPU + 从 MySQL 拉å–åŽŸå§‹æ•°æ® + è°ƒç” @@ -81,7 +81,7 @@ Java åªè´Ÿè´£â€œ**选出è¦ç´¢å¼•çš„ SPU + 从 MySQL 拉å–åŽŸå§‹æ•°æ® + è°ƒç”
81 - 输入:**MySQL 基础数æ®**(`shoplazza_product_spu/sku/option/category/image` 等)。 81 - 输入:**MySQL 基础数æ®**(`shoplazza_product_spu/sku/option/category/image` 等)。
82 - 输出:**ç¬¦åˆ `mappings/search_products.json` çš„ doc 列表**,包括: 82 - 输出:**ç¬¦åˆ `mappings/search_products.json` çš„ doc 列表**,包括:
83 - 多语言文本字段:`title.*`, `brief.*`, `description.*`, `vendor.*`, `category_path.*`, `category_name_text.*`; 83 - 多语言文本字段:`title.*`, `brief.*`, `description.*`, `vendor.*`, `category_path.*`, `category_name_text.*`;
84 - - 算法特å¾ï¼š`title_embedding`, `image_embedding`, `qanchors.*`, `keywords.*`ï¼ˆæœªæ¥æ‰©å±•); 84 + - 算法特å¾ï¼š`title_embedding`, `image_embedding`ï¼›
85 - 结构化字段:`tags`, `specifications`, `skus`, `min_price`, `max_price`, `compare_at_price`, `total_inventory`, `sales` 等。 85 - 结构化字段:`tags`, `specifications`, `skus`, `min_price`, `max_price`, `compare_at_price`, `total_inventory`, `sales` 等。
86 - 附加: 86 - 附加:
87 - 翻译调用 & **Redis 缓存**(继承 Java 的 key 组织和 TTL 策略); 87 - 翻译调用 & **Redis 缓存**(继承 Java 的 key 组织和 TTL 策略);
@@ -370,14 +370,7 @@ if spu.tags: @@ -370,14 +370,7 @@ if spu.tags:
370 370
371 ### 7.2 qanchors / keywords 扩展 371 ### 7.2 qanchors / keywords 扩展
372 372
373 -- å½“å‰ Java 中 `qanchors` 字段结构已存在,但未赋值;  
374 -- 设计建议:  
375 - - 在 Python 侧基于:  
376 - - 标题 / brief / description / tags / 类目等,åš**查询锚点**抽å–ï¼›  
377 - - 按与 `title/keywords` 类似的多语言结构写入 `qanchors.{lang}`ï¼›  
378 - - 翻译策略å¯é€‰ï¼š  
379 - - 在生æˆé”šç‚¹åŽå†è°ƒç”¨ç¿»è¯‘ï¼›  
380 - - 或使用原始文本的翻译结果组åˆã€‚ 373 +该能力已è¿ç§»åˆ°ç‹¬ç«‹å†…容ç†è§£æœåŠ¡ã€‚æœ¬ä»“åº“ä»ä¿ç•™å­—段模型与消费侧能力,但ä¸å†è´Ÿè´£åœ¨ indexer å†…éƒ¨ç”Ÿæˆ `qanchors` / `enriched_*`。
381 374
382 --- 375 ---
383 376
@@ -436,8 +429,6 @@ if spu.tags: @@ -436,8 +429,6 @@ if spu.tags:
436 "spu_id": "1", 429 "spu_id": "1",
437 "tenant_id": "123", 430 "tenant_id": "123",
438 "title": { "en": "...", "zh": "...", ... }, 431 "title": { "en": "...", "zh": "...", ... },
439 - "qanchors": { ... },  
440 - "keywords": { ... },  
441 "brief": { ... }, 432 "brief": { ... },
442 "description": { ... }, 433 "description": { ... },
443 "vendor": { ... }, 434 "vendor": { ... },
@@ -496,7 +487,7 @@ if spu.tags: @@ -496,7 +487,7 @@ if spu.tags:
496 - **ä¿ç•™çŽ°æœ‰ Java 调度 & æ•°æ®åŒæ­¥èƒ½åŠ›**,ä¸ç ´å已有全é‡/增é‡ä»»åŠ¡å’Œ MQ 削峰; 487 - **ä¿ç•™çŽ°æœ‰ Java 调度 & æ•°æ®åŒæ­¥èƒ½åŠ›**,ä¸ç ´å已有全é‡/增é‡ä»»åŠ¡å’Œ MQ 削峰;
497 - **把 ES 文档结构ã€å¤šè¯­è¨€é€»è¾‘ã€ç¿»è¯‘与å‘é‡ç­‰ç®—法能力全部收拢到 Python 索引富化模å—**,实现“å•一 ownerâ€ï¼› 488 - **把 ES 文档结构ã€å¤šè¯­è¨€é€»è¾‘ã€ç¿»è¯‘与å‘é‡ç­‰ç®—法能力全部收拢到 Python 索引富化模å—**,实现“å•一 ownerâ€ï¼›
498 - **完全继承 Java 现有的翻译缓存策略**(Redis key & TTL & 维度),ä¿è¯è¡Œä¸ºä¸Žæ€§èƒ½çš„一致性; 489 - **完全继承 Java 现有的翻译缓存策略**(Redis key & TTL & 维度),ä¿è¯è¡Œä¸ºä¸Žæ€§èƒ½çš„一致性;
499 -- **为未æ¥å­—段扩展(qanchorsã€æ›´å¤š tags/特å¾ï¼‰é¢„留清晰路径**:仅需在 Python 侧新增逻辑和 mapping,ä¸å†æ‹‰ Java 入伙。 490 +- **为未æ¥å­—段扩展(包括外部内容ç†è§£å­—段接入)预留清晰路径**:字段模型å¯ç»§ç»­ä¿ç•™ï¼Œä½†ç”ŸæˆèŒè´£å¯ç‹¬ç«‹æ¼”进。
500 491
501 --- 492 ---
502 493
@@ -514,6 +505,7 @@ if spu.tags: @@ -514,6 +505,7 @@ if spu.tags:
514 - **构建文档(正å¼ä½¿ç”¨ï¼‰**:`POST /indexer/build-docs` 505 - **构建文档(正å¼ä½¿ç”¨ï¼‰**:`POST /indexer/build-docs`
515 - å…¥å‚:`tenant_id + items[ { spu, skus, options } ]` 506 - å…¥å‚:`tenant_id + items[ { spu, skus, options } ]`
516 - 输出:`docs` 数组,æ¯ä¸ªå…ƒç´ æ˜¯å®Œæ•´ ES docï¼Œä¸æŸ¥åº“ã€ä¸å†™ ES。 507 - 输出:`docs` 数组,æ¯ä¸ªå…ƒç´ æ˜¯å®Œæ•´ ES docï¼Œä¸æŸ¥åº“ã€ä¸å†™ ES。
  508 + - 注æ„:当å‰ä¸å†å†…ç½®ç”Ÿæˆ `qanchors` / `enriched_*`;如需这些字段,请由独立内容ç†è§£æœåŠ¡ç”ŸæˆåŽè‡ªè¡Œåˆå¹¶ã€‚
517 509
518 - **构建文档(测试用,内部查库)**:`POST /indexer/build-docs-from-db` 510 - **构建文档(测试用,内部查库)**:`POST /indexer/build-docs-from-db`
519 - å…¥å‚:`{"tenant_id": "...", "spu_ids": ["..."]}` 511 - å…¥å‚:`{"tenant_id": "...", "spu_ids": ["..."]}`
indexer/document_transformer.py
@@ -12,7 +12,6 @@ import pandas as pd @@ -12,7 +12,6 @@ import pandas as pd
12 import numpy as np 12 import numpy as np
13 import logging 13 import logging
14 from typing import Dict, Any, Optional, List 14 from typing import Dict, Any, Optional, List
15 -from indexer.product_enrich import build_index_content_fields  
16 15
17 logger = logging.getLogger(__name__) 16 logger = logging.getLogger(__name__)
18 17
@@ -113,7 +112,6 @@ class SPUDocumentTransformer: @@ -113,7 +112,6 @@ class SPUDocumentTransformer:
113 spu_row: pd.Series, 112 spu_row: pd.Series,
114 skus: pd.DataFrame, 113 skus: pd.DataFrame,
115 options: pd.DataFrame, 114 options: pd.DataFrame,
116 - fill_llm_attributes: bool = True,  
117 ) -> Optional[Dict[str, Any]]: 115 ) -> Optional[Dict[str, Any]]:
118 """ 116 """
119 将单个SPU行和其SKUs转换为ES文档。 117 将单个SPU行和其SKUs转换为ES文档。
@@ -228,85 +226,8 @@ class SPUDocumentTransformer: @@ -228,85 +226,8 @@ class SPUDocumentTransformer:
228 else: 226 else:
229 doc['update_time'] = str(update_time) 227 doc['update_time'] = str(update_time)
230 228
231 - # 基于 LLM 的锚文本与语义属性(默认开启,失败时仅记录日志)  
232 - # 注意:批处理场景(build-docs / bulk / incremental)应优先在外层攒批,  
233 - # 再调用 fill_llm_attributes_batch(),避免逐条调用 LLM。  
234 - if fill_llm_attributes:  
235 - self._fill_llm_attributes(doc, spu_row)  
236 -  
237 return doc 229 return doc
238 230
239 - def fill_llm_attributes_batch(self, docs: List[Dict[str, Any]], spu_rows: List[pd.Series]) -> None:  
240 - """  
241 - 批量调用 LLM,为一批 doc 填充:  
242 - - qanchors.{lang}  
243 - - enriched_tags.{lang}  
244 - - enriched_attributes[].value.{lang}  
245 - - enriched_taxonomy_attributes[].value.{lang}  
246 -  
247 - 设计目标:  
248 - - 尽可能攒批调用 LLM;  
249 - - 单次 LLM 调用最多 20 条(由 analyze_products 内部强制 cap 并自动拆批)。  
250 - """  
251 - if not docs or not spu_rows or len(docs) != len(spu_rows):  
252 - return  
253 -  
254 - id_to_idx: Dict[str, int] = {}  
255 - items: List[Dict[str, str]] = []  
256 - for i, row in enumerate(spu_rows):  
257 - raw_id = row.get("id")  
258 - spu_id = "" if raw_id is None else str(raw_id).strip()  
259 - title = str(row.get("title") or "").strip()  
260 - if not spu_id or not title:  
261 - continue  
262 - id_to_idx[spu_id] = i  
263 - items.append(  
264 - {  
265 - "id": spu_id,  
266 - "title": title,  
267 - "brief": str(row.get("brief") or "").strip(),  
268 - "description": str(row.get("description") or "").strip(),  
269 - "image_url": str(row.get("image_src") or "").strip(),  
270 - }  
271 - )  
272 - if not items:  
273 - return  
274 -  
275 - tenant_id = str(docs[0].get("tenant_id") or "").strip() or None  
276 - try:  
277 - # TODO: 从数据库读取该 tenant 的真实行业,并据此替换当前默认的 apparel profile。  
278 - results = build_index_content_fields(  
279 - items=items,  
280 - tenant_id=tenant_id,  
281 - category_taxonomy_profile="apparel",  
282 - )  
283 - except Exception as e:  
284 - logger.warning("LLM batch attribute fill failed: %s", e)  
285 - return  
286 -  
287 - for result in results:  
288 - spu_id = str(result.get("id") or "").strip()  
289 - if not spu_id:  
290 - continue  
291 - idx = id_to_idx.get(spu_id)  
292 - if idx is None:  
293 - continue  
294 - self._apply_content_enrichment(docs[idx], result)  
295 -  
296 - def _apply_content_enrichment(self, doc: Dict[str, Any], enrichment: Dict[str, Any]) -> None:  
297 - """将 product_enrich 产出的 ES-ready 内容字段写入 doc。"""  
298 - try:  
299 - if enrichment.get("qanchors"):  
300 - doc["qanchors"] = enrichment["qanchors"]  
301 - if enrichment.get("enriched_tags"):  
302 - doc["enriched_tags"] = enrichment["enriched_tags"]  
303 - if enrichment.get("enriched_attributes"):  
304 - doc["enriched_attributes"] = enrichment["enriched_attributes"]  
305 - if enrichment.get("enriched_taxonomy_attributes"):  
306 - doc["enriched_taxonomy_attributes"] = enrichment["enriched_taxonomy_attributes"]  
307 - except Exception as e:  
308 - logger.warning("Failed to apply enrichment to doc (spu_id=%s): %s", doc.get("spu_id"), e)  
309 -  
310 def _fill_text_fields( 231 def _fill_text_fields(
311 self, 232 self,
312 doc: Dict[str, Any], 233 doc: Dict[str, Any],
@@ -660,41 +581,6 @@ class SPUDocumentTransformer: @@ -660,41 +581,6 @@ class SPUDocumentTransformer:
660 else: 581 else:
661 doc['option3_values'] = [] 582 doc['option3_values'] = []
662 583
663 - def _fill_llm_attributes(self, doc: Dict[str, Any], spu_row: pd.Series) -> None:  
664 - """  
665 - 调用 indexer.product_enrich 的高层内容理解入口,为当前 SPU 填充:  
666 - - qanchors.{lang}  
667 - - enriched_tags.{lang}  
668 - - enriched_attributes[].value.{lang}  
669 - """  
670 - spu_id = str(spu_row.get("id") or "").strip()  
671 - title = str(spu_row.get("title") or "").strip()  
672 - if not spu_id or not title:  
673 - return  
674 -  
675 - tenant_id = doc.get("tenant_id")  
676 - try:  
677 - # TODO: 从数据库读取该 tenant 的真实行业,并据此替换当前默认的 apparel profile。  
678 - results = build_index_content_fields(  
679 - items=[  
680 - {  
681 - "id": spu_id,  
682 - "title": title,  
683 - "brief": str(spu_row.get("brief") or "").strip(),  
684 - "description": str(spu_row.get("description") or "").strip(),  
685 - "image_url": str(spu_row.get("image_src") or "").strip(),  
686 - }  
687 - ],  
688 - tenant_id=str(tenant_id),  
689 - category_taxonomy_profile="apparel",  
690 - )  
691 - except Exception as e:  
692 - logger.warning("LLM attribute fill failed for SPU %s: %s", spu_id, e)  
693 - return  
694 -  
695 - if results:  
696 - self._apply_content_enrichment(doc, results[0])  
697 -  
698 def _transform_sku_row(self, sku_row: pd.Series, option_name_map: Dict[int, str] = None) -> Optional[Dict[str, Any]]: 584 def _transform_sku_row(self, sku_row: pd.Series, option_name_map: Dict[int, str] = None) -> Optional[Dict[str, Any]]:
699 """ 585 """
700 将SKU行转换为SKU对象。 586 将SKU行转换为SKU对象。
indexer/incremental_service.py
@@ -584,7 +584,6 @@ class IncrementalIndexerService: @@ -584,7 +584,6 @@ class IncrementalIndexerService:
584 transformer, encoder, enable_embedding = self._get_transformer_bundle(tenant_id) 584 transformer, encoder, enable_embedding = self._get_transformer_bundle(tenant_id)
585 585
586 # 按输入顺序处理 active SPUs 586 # 按输入顺序处理 active SPUs
587 - doc_spu_rows: List[pd.Series] = []  
588 for spu_id in spu_ids: 587 for spu_id in spu_ids:
589 try: 588 try:
590 spu_id_int = int(spu_id) 589 spu_id_int = int(spu_id)
@@ -603,7 +602,6 @@ class IncrementalIndexerService: @@ -603,7 +602,6 @@ class IncrementalIndexerService:
603 spu_row=spu_row, 602 spu_row=spu_row,
604 skus=skus_for_spu, 603 skus=skus_for_spu,
605 options=opts_for_spu, 604 options=opts_for_spu,
606 - fill_llm_attributes=False,  
607 ) 605 )
608 if doc is None: 606 if doc is None:
609 error_msg = "SPU transform returned None" 607 error_msg = "SPU transform returned None"
@@ -612,14 +610,6 @@ class IncrementalIndexerService: @@ -612,14 +610,6 @@ class IncrementalIndexerService:
612 continue 610 continue
613 611
614 documents.append((spu_id, doc)) 612 documents.append((spu_id, doc))
615 - doc_spu_rows.append(spu_row)  
616 -  
617 - # 批量填充 LLM 字段(尽量攒批,每次最多 20 条;失败仅 warning,不影响主流程)  
618 - try:  
619 - if documents and doc_spu_rows:  
620 - transformer.fill_llm_attributes_batch([d for _, d in documents], doc_spu_rows)  
621 - except Exception as e:  
622 - logger.warning("[IncrementalIndexing] Batch LLM fill failed: %s", e)  
623 613
624 # 批量生成 embedding(保持翻译逻辑不变;embedding 走缓存) 614 # 批量生成 embedding(保持翻译逻辑不变;embedding 走缓存)
625 if enable_embedding and encoder and documents: 615 if enable_embedding and encoder and documents:
indexer/product_enrich.py deleted
@@ -1,1421 +0,0 @@ @@ -1,1421 +0,0 @@
1 -#!/usr/bin/env python3  
2 -"""  
3 -商品内容理解与属性补充模块(product_enrich)  
4 -  
5 -提供基于 LLM 的商品锚文本 / 语义属性 / 标签等分析能力,  
6 -供 indexer 与 API 在内存中调用(不再负责 CSV 读写)。  
7 -"""  
8 -  
9 -import os  
10 -import json  
11 -import logging  
12 -import re  
13 -import time  
14 -import hashlib  
15 -import uuid  
16 -import threading  
17 -from dataclasses import dataclass, field  
18 -from collections import OrderedDict  
19 -from datetime import datetime  
20 -from concurrent.futures import ThreadPoolExecutor  
21 -from typing import List, Dict, Tuple, Any, Optional, FrozenSet  
22 -  
23 -import redis  
24 -import requests  
25 -from pathlib import Path  
26 -  
27 -from config.loader import get_app_config  
28 -from config.tenant_config_loader import SOURCE_LANG_CODE_MAP  
29 -from indexer.product_enrich_prompts import (  
30 - SYSTEM_MESSAGE,  
31 - USER_INSTRUCTION_TEMPLATE,  
32 - LANGUAGE_MARKDOWN_TABLE_HEADERS,  
33 - SHARED_ANALYSIS_INSTRUCTION,  
34 - CATEGORY_TAXONOMY_PROFILES,  
35 -)  
36 -  
37 -# 配置  
38 -BATCH_SIZE = 20  
39 -# enrich-content LLM 批次并发 worker 上限(线程池;仅对 uncached batch 并发)  
40 -_APP_CONFIG = get_app_config()  
41 -CONTENT_UNDERSTANDING_MAX_WORKERS = int(_APP_CONFIG.product_enrich.max_workers)  
42 -# 华北2(北京):https://dashscope.aliyuncs.com/compatible-mode/v1  
43 -# 新加坡:https://dashscope-intl.aliyuncs.com/compatible-mode/v1  
44 -# 美国(弗吉尼亚):https://dashscope-us.aliyuncs.com/compatible-mode/v1  
45 -API_BASE_URL = "https://dashscope-us.aliyuncs.com/compatible-mode/v1"  
46 -MODEL_NAME = "qwen-flash"  
47 -API_KEY = os.environ.get("DASHSCOPE_API_KEY")  
48 -MAX_RETRIES = 3  
49 -RETRY_DELAY = 5 # 秒  
50 -REQUEST_TIMEOUT = 180 # 秒  
51 -LOGGED_SHARED_CONTEXT_CACHE_SIZE = 256  
52 -PROMPT_INPUT_MIN_ZH_CHARS = 20  
53 -PROMPT_INPUT_MAX_ZH_CHARS = 100  
54 -PROMPT_INPUT_MIN_WORDS = 16  
55 -PROMPT_INPUT_MAX_WORDS = 80  
56 -  
57 -# 日志路径  
58 -OUTPUT_DIR = Path("output_logs")  
59 -LOG_DIR = OUTPUT_DIR / "logs"  
60 -  
61 -# 设置独立日志(不影响全局 indexer.log)  
62 -LOG_DIR.mkdir(parents=True, exist_ok=True)  
63 -timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")  
64 -log_file = LOG_DIR / f"product_enrich_{timestamp}.log"  
65 -verbose_log_file = LOG_DIR / "product_enrich_verbose.log"  
66 -_logged_shared_context_keys: "OrderedDict[str, None]" = OrderedDict()  
67 -_logged_shared_context_lock = threading.Lock()  
68 -  
69 -_content_understanding_executor: Optional[ThreadPoolExecutor] = None  
70 -_content_understanding_executor_lock = threading.Lock()  
71 -  
72 -  
73 -def _get_content_understanding_executor() -> ThreadPoolExecutor:  
74 - """  
75 - 使用模块级单例线程池,避免同一进程内多次请求叠加创建线程池导致并发失控。  
76 - """  
77 - global _content_understanding_executor  
78 - with _content_understanding_executor_lock:  
79 - if _content_understanding_executor is None:  
80 - _content_understanding_executor = ThreadPoolExecutor(  
81 - max_workers=CONTENT_UNDERSTANDING_MAX_WORKERS,  
82 - thread_name_prefix="product-enrich-llm",  
83 - )  
84 - return _content_understanding_executor  
85 -  
86 -# 主日志 logger:执行流程、批次信息等  
87 -logger = logging.getLogger("product_enrich")  
88 -logger.setLevel(logging.INFO)  
89 -  
90 -if not logger.handlers:  
91 - formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")  
92 -  
93 - file_handler = logging.FileHandler(log_file, encoding="utf-8")  
94 - file_handler.setFormatter(formatter)  
95 -  
96 - stream_handler = logging.StreamHandler()  
97 - stream_handler.setFormatter(formatter)  
98 -  
99 - logger.addHandler(file_handler)  
100 - logger.addHandler(stream_handler)  
101 -  
102 - # 避免日志向根 logger 传播,防止写入 logs/indexer.log 等其他文件  
103 - logger.propagate = False  
104 -  
105 -# 详尽日志 logger:专门记录 LLM 请求与响应  
106 -verbose_logger = logging.getLogger("product_enrich_verbose")  
107 -verbose_logger.setLevel(logging.INFO)  
108 -  
109 -if not verbose_logger.handlers:  
110 - verbose_formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")  
111 - verbose_file_handler = logging.FileHandler(verbose_log_file, encoding="utf-8")  
112 - verbose_file_handler.setFormatter(verbose_formatter)  
113 - verbose_logger.addHandler(verbose_file_handler)  
114 - verbose_logger.propagate = False  
115 -  
116 -logger.info("Verbose LLM logs are written to: %s", verbose_log_file)  
117 -  
118 -  
119 -# Redis 缓存(用于 anchors / 语义属性)  
120 -_REDIS_CONFIG = _APP_CONFIG.infrastructure.redis  
121 -ANCHOR_CACHE_PREFIX = _REDIS_CONFIG.anchor_cache_prefix  
122 -ANCHOR_CACHE_EXPIRE_DAYS = int(_REDIS_CONFIG.anchor_cache_expire_days)  
123 -_anchor_redis: Optional[redis.Redis] = None  
124 -  
125 -try:  
126 - _anchor_redis = redis.Redis(  
127 - host=_REDIS_CONFIG.host,  
128 - port=_REDIS_CONFIG.port,  
129 - password=_REDIS_CONFIG.password,  
130 - decode_responses=True,  
131 - socket_timeout=_REDIS_CONFIG.socket_timeout,  
132 - socket_connect_timeout=_REDIS_CONFIG.socket_connect_timeout,  
133 - retry_on_timeout=_REDIS_CONFIG.retry_on_timeout,  
134 - health_check_interval=10,  
135 - )  
136 - _anchor_redis.ping()  
137 - logger.info("Redis cache initialized for product anchors and semantic attributes")  
138 -except Exception as e:  
139 - logger.warning(f"Failed to initialize Redis for anchors cache: {e}")  
140 - _anchor_redis = None  
141 -  
142 -_missing_prompt_langs = sorted(set(SOURCE_LANG_CODE_MAP) - set(LANGUAGE_MARKDOWN_TABLE_HEADERS))  
143 -if _missing_prompt_langs:  
144 - raise RuntimeError(  
145 - f"Missing product_enrich prompt config for languages: {_missing_prompt_langs}"  
146 - )  
147 -  
148 -  
149 -# 多值字段分隔  
150 -_MULTI_VALUE_FIELD_SPLIT_RE = re.compile(r"[,、,;|/\n\t]+")  
151 -# 表格单元格中视为「无内容」的占位  
152 -_MARKDOWN_EMPTY_CELL_LITERALS: Tuple[str, ...] = ("-","–", "—", "none", "null", "n/a", "无")  
153 -_MARKDOWN_EMPTY_CELL_TOKENS_CF: FrozenSet[str] = frozenset(  
154 - lit.casefold() for lit in _MARKDOWN_EMPTY_CELL_LITERALS  
155 -)  
156 -  
157 -def _normalize_markdown_table_cell(raw: Optional[str]) -> str:  
158 - """strip;将占位符统一视为空字符串。"""  
159 - s = str(raw or "").strip()  
160 - if not s:  
161 - return ""  
162 - if s.casefold() in _MARKDOWN_EMPTY_CELL_TOKENS_CF:  
163 - return ""  
164 - return s  
165 -_CORE_INDEX_LANGUAGES = ("zh", "en")  
166 -_DEFAULT_ENRICHMENT_SCOPES = ("generic", "category_taxonomy")  
167 -_DEFAULT_CATEGORY_TAXONOMY_PROFILE = "apparel"  
168 -_CONTENT_ANALYSIS_ATTRIBUTE_FIELD_MAP = (  
169 - ("tags", "enriched_tags"),  
170 - ("target_audience", "target_audience"),  
171 - ("usage_scene", "usage_scene"),  
172 - ("season", "season"),  
173 - ("key_attributes", "key_attributes"),  
174 - ("material", "material"),  
175 - ("features", "features"),  
176 -)  
177 -_CONTENT_ANALYSIS_RESULT_FIELDS = (  
178 - "title",  
179 - "category_path",  
180 - "tags",  
181 - "target_audience",  
182 - "usage_scene",  
183 - "season",  
184 - "key_attributes",  
185 - "material",  
186 - "features",  
187 - "anchor_text",  
188 -)  
189 -_CONTENT_ANALYSIS_MEANINGFUL_FIELDS = (  
190 - "tags",  
191 - "target_audience",  
192 - "usage_scene",  
193 - "season",  
194 - "key_attributes",  
195 - "material",  
196 - "features",  
197 - "anchor_text",  
198 -)  
199 -_CONTENT_ANALYSIS_FIELD_ALIASES = {  
200 - "tags": ("tags", "enriched_tags"),  
201 -}  
202 -_CONTENT_ANALYSIS_QUALITY_FIELDS = ("title", "category_path", "anchor_text")  
203 -  
204 -  
205 -@dataclass(frozen=True)  
206 -class AnalysisSchema:  
207 - name: str  
208 - shared_instruction: str  
209 - markdown_table_headers: Dict[str, List[str]]  
210 - result_fields: Tuple[str, ...]  
211 - meaningful_fields: Tuple[str, ...]  
212 - cache_version: str = "v1"  
213 - field_aliases: Dict[str, Tuple[str, ...]] = field(default_factory=dict)  
214 - quality_fields: Tuple[str, ...] = ()  
215 -  
216 - def get_headers(self, target_lang: str) -> Optional[List[str]]:  
217 - return self.markdown_table_headers.get(target_lang)  
218 -  
219 -  
220 -_ANALYSIS_SCHEMAS: Dict[str, AnalysisSchema] = {  
221 - "content": AnalysisSchema(  
222 - name="content",  
223 - shared_instruction=SHARED_ANALYSIS_INSTRUCTION,  
224 - markdown_table_headers=LANGUAGE_MARKDOWN_TABLE_HEADERS,  
225 - result_fields=_CONTENT_ANALYSIS_RESULT_FIELDS,  
226 - meaningful_fields=_CONTENT_ANALYSIS_MEANINGFUL_FIELDS,  
227 - cache_version="v2",  
228 - field_aliases=_CONTENT_ANALYSIS_FIELD_ALIASES,  
229 - quality_fields=_CONTENT_ANALYSIS_QUALITY_FIELDS,  
230 - ),  
231 -}  
232 -  
233 -def _build_taxonomy_profile_schema(profile: str, config: Dict[str, Any]) -> AnalysisSchema:  
234 - return AnalysisSchema(  
235 - name=f"taxonomy:{profile}",  
236 - shared_instruction=config["shared_instruction"],  
237 - markdown_table_headers=config["markdown_table_headers"],  
238 - result_fields=tuple(field["key"] for field in config["fields"]),  
239 - meaningful_fields=tuple(field["key"] for field in config["fields"]),  
240 - cache_version="v1",  
241 - )  
242 -  
243 -  
244 -_CATEGORY_TAXONOMY_PROFILE_SCHEMAS: Dict[str, AnalysisSchema] = {  
245 - profile: _build_taxonomy_profile_schema(profile, config)  
246 - for profile, config in CATEGORY_TAXONOMY_PROFILES.items()  
247 -}  
248 -  
249 -_CATEGORY_TAXONOMY_PROFILE_ATTRIBUTE_FIELD_MAPS: Dict[str, Tuple[Tuple[str, str], ...]] = {  
250 - profile: tuple((field["key"], field["label"]) for field in config["fields"])  
251 - for profile, config in CATEGORY_TAXONOMY_PROFILES.items()  
252 -}  
253 -  
254 -  
255 -def get_supported_category_taxonomy_profiles() -> Tuple[str, ...]:  
256 - return tuple(_CATEGORY_TAXONOMY_PROFILE_SCHEMAS.keys())  
257 -  
258 -  
259 -def _normalize_category_taxonomy_profile(category_taxonomy_profile: Optional[str] = None) -> str:  
260 - profile = str(category_taxonomy_profile or _DEFAULT_CATEGORY_TAXONOMY_PROFILE).strip()  
261 - if profile not in _CATEGORY_TAXONOMY_PROFILE_SCHEMAS:  
262 - supported = ", ".join(get_supported_category_taxonomy_profiles())  
263 - raise ValueError(  
264 - f"Unsupported category_taxonomy_profile: {profile}. Supported profiles: {supported}"  
265 - )  
266 - return profile  
267 -  
268 -  
269 -def _get_analysis_schema(  
270 - analysis_kind: str,  
271 - *,  
272 - category_taxonomy_profile: Optional[str] = None,  
273 -) -> AnalysisSchema:  
274 - if analysis_kind == "content":  
275 - return _ANALYSIS_SCHEMAS["content"]  
276 - if analysis_kind == "taxonomy":  
277 - profile = _normalize_category_taxonomy_profile(category_taxonomy_profile)  
278 - return _CATEGORY_TAXONOMY_PROFILE_SCHEMAS[profile]  
279 - raise ValueError(f"Unsupported analysis_kind: {analysis_kind}")  
280 -  
281 -  
282 -def _get_taxonomy_attribute_field_map(  
283 - category_taxonomy_profile: Optional[str] = None,  
284 -) -> Tuple[Tuple[str, str], ...]:  
285 - profile = _normalize_category_taxonomy_profile(category_taxonomy_profile)  
286 - return _CATEGORY_TAXONOMY_PROFILE_ATTRIBUTE_FIELD_MAPS[profile]  
287 -  
288 -  
289 -def _normalize_enrichment_scopes(  
290 - enrichment_scopes: Optional[List[str]] = None,  
291 -) -> Tuple[str, ...]:  
292 - requested = _DEFAULT_ENRICHMENT_SCOPES if not enrichment_scopes else tuple(enrichment_scopes)  
293 - normalized: List[str] = []  
294 - seen = set()  
295 - for enrichment_scope in requested:  
296 - scope = str(enrichment_scope).strip()  
297 - if scope not in {"generic", "category_taxonomy"}:  
298 - raise ValueError(f"Unsupported enrichment_scope: {scope}")  
299 - if scope in seen:  
300 - continue  
301 - seen.add(scope)  
302 - normalized.append(scope)  
303 - return tuple(normalized)  
304 -  
305 -  
306 -def split_multi_value_field(text: Optional[str]) -> List[str]:  
307 - """将 LLM/业务中的多值字符串拆成短语列表(strip 后去空)。"""  
308 - if text is None:  
309 - return []  
310 - s = str(text).strip()  
311 - if not s:  
312 - return []  
313 - return [p.strip() for p in _MULTI_VALUE_FIELD_SPLIT_RE.split(s) if p.strip()]  
314 -  
315 -  
316 -def _append_lang_phrase_map(target: Dict[str, List[str]], lang: str, raw_value: Any) -> None:  
317 - parts = split_multi_value_field(raw_value)  
318 - if not parts:  
319 - return  
320 - existing = target.get(lang) or []  
321 - merged = list(dict.fromkeys([str(x).strip() for x in existing if str(x).strip()] + parts))  
322 - if merged:  
323 - target[lang] = merged  
324 -  
325 -  
326 -def _get_or_create_named_value_entry(  
327 - target: List[Dict[str, Any]],  
328 - name: str,  
329 - *,  
330 - default_value: Optional[Dict[str, Any]] = None,  
331 -) -> Dict[str, Any]:  
332 - for item in target:  
333 - if item.get("name") == name:  
334 - value = item.get("value")  
335 - if isinstance(value, dict):  
336 - return item  
337 - break  
338 -  
339 - entry = {"name": name, "value": default_value or {}}  
340 - target.append(entry)  
341 - return entry  
342 -  
343 -  
344 -def _append_named_lang_phrase_map(  
345 - target: List[Dict[str, Any]],  
346 - name: str,  
347 - lang: str,  
348 - raw_value: Any,  
349 -) -> None:  
350 - entry = _get_or_create_named_value_entry(target, name=name, default_value={})  
351 - _append_lang_phrase_map(entry["value"], lang=lang, raw_value=raw_value)  
352 -  
353 -  
354 -def _get_product_id(product: Dict[str, Any]) -> str:  
355 - return str(product.get("id") or product.get("spu_id") or "").strip()  
356 -  
357 -  
358 -def _get_analysis_field_aliases(field_name: str, schema: AnalysisSchema) -> Tuple[str, ...]:  
359 - return schema.field_aliases.get(field_name, (field_name,))  
360 -  
361 -  
362 -def _get_analysis_field_value(row: Dict[str, Any], field_name: str, schema: AnalysisSchema) -> Any:  
363 - for alias in _get_analysis_field_aliases(field_name, schema):  
364 - if alias in row:  
365 - return row.get(alias)  
366 - return None  
367 -  
368 -  
369 -def _has_meaningful_value(value: Any) -> bool:  
370 - if value is None:  
371 - return False  
372 - if isinstance(value, str):  
373 - return bool(value.strip())  
374 - if isinstance(value, dict):  
375 - return any(_has_meaningful_value(v) for v in value.values())  
376 - if isinstance(value, list):  
377 - return any(_has_meaningful_value(v) for v in value)  
378 - return bool(value)  
379 -  
380 -  
381 -def _make_empty_analysis_result(  
382 - product: Dict[str, Any],  
383 - target_lang: str,  
384 - schema: AnalysisSchema,  
385 - error: Optional[str] = None,  
386 -) -> Dict[str, Any]:  
387 - result = {  
388 - "id": _get_product_id(product),  
389 - "lang": target_lang,  
390 - "title_input": str(product.get("title") or "").strip(),  
391 - }  
392 - for field in schema.result_fields:  
393 - result[field] = ""  
394 - if error:  
395 - result["error"] = error  
396 - return result  
397 -  
398 -  
399 -def _normalize_analysis_result(  
400 - result: Dict[str, Any],  
401 - product: Dict[str, Any],  
402 - target_lang: str,  
403 - schema: AnalysisSchema,  
404 -) -> Dict[str, Any]:  
405 - normalized = _make_empty_analysis_result(product, target_lang, schema)  
406 - if not isinstance(result, dict):  
407 - return normalized  
408 -  
409 - normalized["lang"] = str(result.get("lang") or target_lang).strip() or target_lang  
410 - normalized["title_input"] = str(  
411 - product.get("title") or result.get("title_input") or ""  
412 - ).strip()  
413 -  
414 - for field in schema.result_fields:  
415 - normalized[field] = str(_get_analysis_field_value(result, field, schema) or "").strip()  
416 -  
417 - if result.get("error"):  
418 - normalized["error"] = str(result.get("error"))  
419 - return normalized  
420 -  
421 -  
422 -def _has_meaningful_analysis_content(result: Dict[str, Any], schema: AnalysisSchema) -> bool:  
423 - return any(_has_meaningful_value(result.get(field)) for field in schema.meaningful_fields)  
424 -  
425 -  
426 -def _append_analysis_attributes(  
427 - target: List[Dict[str, Any]],  
428 - row: Dict[str, Any],  
429 - lang: str,  
430 - schema: AnalysisSchema,  
431 - field_map: Tuple[Tuple[str, str], ...],  
432 -) -> None:  
433 - for source_name, output_name in field_map:  
434 - raw = _get_analysis_field_value(row, source_name, schema)  
435 - if not raw:  
436 - continue  
437 - _append_named_lang_phrase_map(  
438 - target,  
439 - name=output_name,  
440 - lang=lang,  
441 - raw_value=raw,  
442 - )  
443 -  
444 -  
445 -def _apply_index_content_row(result: Dict[str, Any], row: Dict[str, Any], lang: str) -> None:  
446 - if not row or row.get("error"):  
447 - return  
448 -  
449 - content_schema = _get_analysis_schema("content")  
450 - anchor_text = str(_get_analysis_field_value(row, "anchor_text", content_schema) or "").strip()  
451 - if anchor_text:  
452 - _append_lang_phrase_map(result["qanchors"], lang=lang, raw_value=anchor_text)  
453 -  
454 - for source_name, output_name in _CONTENT_ANALYSIS_ATTRIBUTE_FIELD_MAP:  
455 - raw = _get_analysis_field_value(row, source_name, content_schema)  
456 - if not raw:  
457 - continue  
458 - _append_named_lang_phrase_map(  
459 - result["enriched_attributes"],  
460 - name=output_name,  
461 - lang=lang,  
462 - raw_value=raw,  
463 - )  
464 - if output_name == "enriched_tags":  
465 - _append_lang_phrase_map(result["enriched_tags"], lang=lang, raw_value=raw)  
466 -  
467 -  
468 -def _apply_index_taxonomy_row(  
469 - result: Dict[str, Any],  
470 - row: Dict[str, Any],  
471 - lang: str,  
472 - *,  
473 - category_taxonomy_profile: Optional[str] = None,  
474 -) -> None:  
475 - if not row or row.get("error"):  
476 - return  
477 -  
478 - _append_analysis_attributes(  
479 - result["enriched_taxonomy_attributes"],  
480 - row=row,  
481 - lang=lang,  
482 - schema=_get_analysis_schema(  
483 - "taxonomy",  
484 - category_taxonomy_profile=category_taxonomy_profile,  
485 - ),  
486 - field_map=_get_taxonomy_attribute_field_map(category_taxonomy_profile),  
487 - )  
488 -  
489 -  
490 -def _normalize_index_content_item(item: Dict[str, Any]) -> Dict[str, str]:  
491 - item_id = _get_product_id(item)  
492 - return {  
493 - "id": item_id,  
494 - "title": str(item.get("title") or "").strip(),  
495 - "brief": str(item.get("brief") or "").strip(),  
496 - "description": str(item.get("description") or "").strip(),  
497 - "image_url": str(item.get("image_url") or "").strip(),  
498 - }  
499 -  
500 -  
501 -def build_index_content_fields(  
502 - items: List[Dict[str, Any]],  
503 - tenant_id: Optional[str] = None,  
504 - enrichment_scopes: Optional[List[str]] = None,  
505 - category_taxonomy_profile: Optional[str] = None,  
506 -) -> List[Dict[str, Any]]:  
507 - """  
508 - 高层入口:生成与 ES mapping 对齐的内容理解字段。  
509 -  
510 - 输入项需包含:  
511 - - `id` 或 `spu_id`  
512 - - `title`  
513 - - 可选 `brief` / `description` / `image_url`  
514 - - 可选 `enrichment_scopes`,默认同时执行 `generic` 与 `category_taxonomy`  
515 - - 可选 `category_taxonomy_profile`,默认 `apparel`  
516 -  
517 - 返回项结构:  
518 - - `id`  
519 - - `qanchors`  
520 - - `enriched_tags`  
521 - - `enriched_attributes`  
522 - - `enriched_taxonomy_attributes`  
523 - - 可选 `error`  
524 -  
525 - 其中:  
526 - - `qanchors.{lang}` 为短语数组  
527 - - `enriched_tags.{lang}` 为标签数组  
528 - """  
529 - requested_enrichment_scopes = _normalize_enrichment_scopes(enrichment_scopes)  
530 - normalized_taxonomy_profile = _normalize_category_taxonomy_profile(category_taxonomy_profile)  
531 - normalized_items = [_normalize_index_content_item(item) for item in items]  
532 - if not normalized_items:  
533 - return []  
534 -  
535 - results_by_id: Dict[str, Dict[str, Any]] = {  
536 - item["id"]: {  
537 - "id": item["id"],  
538 - "qanchors": {},  
539 - "enriched_tags": {},  
540 - "enriched_attributes": [],  
541 - "enriched_taxonomy_attributes": [],  
542 - }  
543 - for item in normalized_items  
544 - }  
545 -  
546 - for lang in _CORE_INDEX_LANGUAGES:  
547 - if "generic" in requested_enrichment_scopes:  
548 - try:  
549 - rows = analyze_products(  
550 - products=normalized_items,  
551 - target_lang=lang,  
552 - batch_size=BATCH_SIZE,  
553 - tenant_id=tenant_id,  
554 - analysis_kind="content",  
555 - category_taxonomy_profile=normalized_taxonomy_profile,  
556 - )  
557 - except Exception as e:  
558 - logger.warning("build_index_content_fields content enrichment failed for lang=%s: %s", lang, e)  
559 - for item in normalized_items:  
560 - results_by_id[item["id"]].setdefault("error", str(e))  
561 - continue  
562 -  
563 - for row in rows or []:  
564 - item_id = str(row.get("id") or "").strip()  
565 - if not item_id or item_id not in results_by_id:  
566 - continue  
567 - if row.get("error"):  
568 - results_by_id[item_id].setdefault("error", row["error"])  
569 - continue  
570 - _apply_index_content_row(results_by_id[item_id], row=row, lang=lang)  
571 -  
572 - if "category_taxonomy" in requested_enrichment_scopes:  
573 - for lang in _CORE_INDEX_LANGUAGES:  
574 - try:  
575 - taxonomy_rows = analyze_products(  
576 - products=normalized_items,  
577 - target_lang=lang,  
578 - batch_size=BATCH_SIZE,  
579 - tenant_id=tenant_id,  
580 - analysis_kind="taxonomy",  
581 - category_taxonomy_profile=normalized_taxonomy_profile,  
582 - )  
583 - except Exception as e:  
584 - logger.warning(  
585 - "build_index_content_fields taxonomy enrichment failed for profile=%s lang=%s: %s",  
586 - normalized_taxonomy_profile,  
587 - lang,  
588 - e,  
589 - )  
590 - for item in normalized_items:  
591 - results_by_id[item["id"]].setdefault("error", str(e))  
592 - continue  
593 -  
594 - for row in taxonomy_rows or []:  
595 - item_id = str(row.get("id") or "").strip()  
596 - if not item_id or item_id not in results_by_id:  
597 - continue  
598 - if row.get("error"):  
599 - results_by_id[item_id].setdefault("error", row["error"])  
600 - continue  
601 - _apply_index_taxonomy_row(  
602 - results_by_id[item_id],  
603 - row=row,  
604 - lang=lang,  
605 - category_taxonomy_profile=normalized_taxonomy_profile,  
606 - )  
607 -  
608 - return [results_by_id[item["id"]] for item in normalized_items]  
609 -  
610 -  
611 -def _normalize_space(text: str) -> str:  
612 - return re.sub(r"\s+", " ", (text or "").strip())  
613 -  
614 -  
615 -def _contains_cjk(text: str) -> bool:  
616 - return bool(re.search(r"[\u3400-\u4dbf\u4e00-\u9fff\uf900-\ufaff]", text or ""))  
617 -  
618 -  
619 -def _truncate_by_chars(text: str, max_chars: int) -> str:  
620 - return text[:max_chars].strip()  
621 -  
622 -  
623 -def _truncate_by_words(text: str, max_words: int) -> str:  
624 - words = re.findall(r"\S+", text or "")  
625 - return " ".join(words[:max_words]).strip()  
626 -  
627 -  
628 -def _detect_prompt_input_lang(text: str) -> str:  
629 - # 简化处理:包含 CJK 时按中文类文本处理,否则统一按空格分词类语言处理。  
630 - return "zh" if _contains_cjk(text) else "en"  
631 -  
632 -  
633 -def _build_prompt_input_text(product: Dict[str, Any]) -> str:  
634 - """  
635 - 生成真正送入 prompt 的商品文本。  
636 -  
637 - 规则:  
638 - - 默认使用 title  
639 - - 若文本过短,则依次补 brief / description  
640 - - 若文本过长,则按语言粗粒度截断  
641 - """  
642 - fields = [  
643 - _normalize_space(str(product.get("title") or "")),  
644 - _normalize_space(str(product.get("brief") or "")),  
645 - _normalize_space(str(product.get("description") or "")),  
646 - ]  
647 - parts: List[str] = []  
648 -  
649 - def join_parts() -> str:  
650 - return " | ".join(part for part in parts if part).strip()  
651 -  
652 - for field in fields:  
653 - if not field:  
654 - continue  
655 - if field not in parts:  
656 - parts.append(field)  
657 - candidate = join_parts()  
658 - if _detect_prompt_input_lang(candidate) == "zh":  
659 - if len(candidate) >= PROMPT_INPUT_MIN_ZH_CHARS:  
660 - return _truncate_by_chars(candidate, PROMPT_INPUT_MAX_ZH_CHARS)  
661 - else:  
662 - if len(re.findall(r"\S+", candidate)) >= PROMPT_INPUT_MIN_WORDS:  
663 - return _truncate_by_words(candidate, PROMPT_INPUT_MAX_WORDS)  
664 -  
665 - candidate = join_parts()  
666 - if not candidate:  
667 - return ""  
668 - if _detect_prompt_input_lang(candidate) == "zh":  
669 - return _truncate_by_chars(candidate, PROMPT_INPUT_MAX_ZH_CHARS)  
670 - return _truncate_by_words(candidate, PROMPT_INPUT_MAX_WORDS)  
671 -  
672 -  
673 -def _make_analysis_cache_key(  
674 - product: Dict[str, Any],  
675 - target_lang: str,  
676 - analysis_kind: str,  
677 - category_taxonomy_profile: Optional[str] = None,  
678 -) -> str:  
679 - """构造缓存 key,仅由分析类型、prompt 实际输入文本内容与目标语言决定。"""  
680 - schema = _get_analysis_schema(  
681 - analysis_kind,  
682 - category_taxonomy_profile=category_taxonomy_profile,  
683 - )  
684 - prompt_input = _build_prompt_input_text(product)  
685 - h = hashlib.md5(prompt_input.encode("utf-8")).hexdigest()  
686 - prompt_contract = {  
687 - "schema_name": schema.name,  
688 - "cache_version": schema.cache_version,  
689 - "system_message": SYSTEM_MESSAGE,  
690 - "user_instruction_template": USER_INSTRUCTION_TEMPLATE,  
691 - "shared_instruction": schema.shared_instruction,  
692 - "assistant_headers": schema.get_headers(target_lang),  
693 - "result_fields": schema.result_fields,  
694 - "meaningful_fields": schema.meaningful_fields,  
695 - "field_aliases": schema.field_aliases,  
696 - }  
697 - prompt_contract_hash = hashlib.md5(  
698 - json.dumps(prompt_contract, ensure_ascii=False, sort_keys=True).encode("utf-8")  
699 - ).hexdigest()[:12]  
700 - return (  
701 - f"{ANCHOR_CACHE_PREFIX}:{analysis_kind}:{prompt_contract_hash}:"  
702 - f"{target_lang}:{prompt_input[:4]}{h}"  
703 - )  
704 -  
705 -  
706 -def _make_anchor_cache_key(  
707 - product: Dict[str, Any],  
708 - target_lang: str,  
709 -) -> str:  
710 - return _make_analysis_cache_key(product, target_lang, analysis_kind="content")  
711 -  
712 -  
713 -def _get_cached_analysis_result(  
714 - product: Dict[str, Any],  
715 - target_lang: str,  
716 - analysis_kind: str,  
717 - category_taxonomy_profile: Optional[str] = None,  
718 -) -> Optional[Dict[str, Any]]:  
719 - if not _anchor_redis:  
720 - return None  
721 - schema = _get_analysis_schema(  
722 - analysis_kind,  
723 - category_taxonomy_profile=category_taxonomy_profile,  
724 - )  
725 - try:  
726 - key = _make_analysis_cache_key(  
727 - product,  
728 - target_lang,  
729 - analysis_kind,  
730 - category_taxonomy_profile=category_taxonomy_profile,  
731 - )  
732 - raw = _anchor_redis.get(key)  
733 - if not raw:  
734 - return None  
735 - result = _normalize_analysis_result(  
736 - json.loads(raw),  
737 - product=product,  
738 - target_lang=target_lang,  
739 - schema=schema,  
740 - )  
741 - if not _has_meaningful_analysis_content(result, schema):  
742 - return None  
743 - return result  
744 - except Exception as e:  
745 - logger.warning("Failed to get %s analysis cache: %s", analysis_kind, e)  
746 - return None  
747 -  
748 -  
749 -def _get_cached_anchor_result(  
750 - product: Dict[str, Any],  
751 - target_lang: str,  
752 -) -> Optional[Dict[str, Any]]:  
753 - return _get_cached_analysis_result(product, target_lang, analysis_kind="content")  
754 -  
755 -  
756 -def _set_cached_analysis_result(  
757 - product: Dict[str, Any],  
758 - target_lang: str,  
759 - result: Dict[str, Any],  
760 - analysis_kind: str,  
761 - category_taxonomy_profile: Optional[str] = None,  
762 -) -> None:  
763 - if not _anchor_redis:  
764 - return  
765 - schema = _get_analysis_schema(  
766 - analysis_kind,  
767 - category_taxonomy_profile=category_taxonomy_profile,  
768 - )  
769 - try:  
770 - normalized = _normalize_analysis_result(  
771 - result,  
772 - product=product,  
773 - target_lang=target_lang,  
774 - schema=schema,  
775 - )  
776 - if not _has_meaningful_analysis_content(normalized, schema):  
777 - return  
778 - key = _make_analysis_cache_key(  
779 - product,  
780 - target_lang,  
781 - analysis_kind,  
782 - category_taxonomy_profile=category_taxonomy_profile,  
783 - )  
784 - ttl = ANCHOR_CACHE_EXPIRE_DAYS * 24 * 3600  
785 - _anchor_redis.setex(key, ttl, json.dumps(normalized, ensure_ascii=False))  
786 - except Exception as e:  
787 - logger.warning("Failed to set %s analysis cache: %s", analysis_kind, e)  
788 -  
789 -  
790 -def _set_cached_anchor_result(  
791 - product: Dict[str, Any],  
792 - target_lang: str,  
793 - result: Dict[str, Any],  
794 -) -> None:  
795 - _set_cached_analysis_result(product, target_lang, result, analysis_kind="content")  
796 -  
797 -  
798 -def _build_assistant_prefix(headers: List[str]) -> str:  
799 - header_line = "| " + " | ".join(headers) + " |"  
800 - separator_line = "|" + "----|" * len(headers)  
801 - return f"{header_line}\n{separator_line}\n"  
802 -  
803 -  
804 -def _build_shared_context(products: List[Dict[str, str]], schema: AnalysisSchema) -> str:  
805 - shared_context = schema.shared_instruction  
806 - for idx, product in enumerate(products, 1):  
807 - prompt_input = _build_prompt_input_text(product)  
808 - shared_context += f"{idx}. {prompt_input}\n"  
809 - return shared_context  
810 -  
811 -  
812 -def _hash_text(text: str) -> str:  
813 - return hashlib.md5((text or "").encode("utf-8")).hexdigest()[:12]  
814 -  
815 -  
816 -def _mark_shared_context_logged_once(shared_context_key: str) -> bool:  
817 - with _logged_shared_context_lock:  
818 - if shared_context_key in _logged_shared_context_keys:  
819 - _logged_shared_context_keys.move_to_end(shared_context_key)  
820 - return False  
821 -  
822 - _logged_shared_context_keys[shared_context_key] = None  
823 - if len(_logged_shared_context_keys) > LOGGED_SHARED_CONTEXT_CACHE_SIZE:  
824 - _logged_shared_context_keys.popitem(last=False)  
825 - return True  
826 -  
827 -  
828 -def reset_logged_shared_context_keys() -> None:  
829 - """测试辅助:清理已记录的共享 prompt key。"""  
830 - with _logged_shared_context_lock:  
831 - _logged_shared_context_keys.clear()  
832 -  
833 -  
834 -def create_prompt(  
835 - products: List[Dict[str, str]],  
836 - target_lang: str = "zh",  
837 - analysis_kind: str = "content",  
838 - category_taxonomy_profile: Optional[str] = None,  
839 -) -> Tuple[Optional[str], Optional[str], Optional[str]]:  
840 - """根据目标语言创建共享上下文、本地化输出要求和 Partial Mode assistant 前缀。"""  
841 - schema = _get_analysis_schema(  
842 - analysis_kind,  
843 - category_taxonomy_profile=category_taxonomy_profile,  
844 - )  
845 - markdown_table_headers = schema.get_headers(target_lang)  
846 - if not markdown_table_headers:  
847 - logger.warning(  
848 - "Unsupported target_lang for markdown table headers: kind=%s lang=%s",  
849 - analysis_kind,  
850 - target_lang,  
851 - )  
852 - return None, None, None  
853 - shared_context = _build_shared_context(products, schema)  
854 - language_label = SOURCE_LANG_CODE_MAP.get(target_lang, target_lang)  
855 - user_prompt = USER_INSTRUCTION_TEMPLATE.format(language=language_label).strip()  
856 - assistant_prefix = _build_assistant_prefix(markdown_table_headers)  
857 - return shared_context, user_prompt, assistant_prefix  
858 -  
859 -  
860 -def _merge_partial_response(assistant_prefix: str, generated_content: str) -> str:  
861 - """将 Partial Mode 的 assistant 前缀与补全文本拼成完整 markdown。"""  
862 - generated = (generated_content or "").lstrip()  
863 - prefix_lines = [line.strip() for line in assistant_prefix.strip().splitlines()]  
864 - generated_lines = generated.splitlines()  
865 -  
866 - if generated_lines:  
867 - first_line = generated_lines[0].strip()  
868 - if prefix_lines and first_line == prefix_lines[0]:  
869 - generated_lines = generated_lines[1:]  
870 - if generated_lines and len(prefix_lines) > 1 and generated_lines[0].strip() == prefix_lines[1]:  
871 - generated_lines = generated_lines[1:]  
872 - elif len(prefix_lines) > 1 and first_line == prefix_lines[1]:  
873 - generated_lines = generated_lines[1:]  
874 -  
875 - suffix = "\n".join(generated_lines).lstrip("\n")  
876 - if suffix:  
877 - return f"{assistant_prefix}{suffix}"  
878 - return assistant_prefix  
879 -  
880 -  
881 -def call_llm(  
882 - shared_context: str,  
883 - user_prompt: str,  
884 - assistant_prefix: str,  
885 - target_lang: str = "zh",  
886 - analysis_kind: str = "content",  
887 -) -> Tuple[str, str]:  
888 - """调用大模型 API(带重试机制),使用 Partial Mode 强制 markdown 表格前缀。"""  
889 - headers = {  
890 - "Authorization": f"Bearer {API_KEY}",  
891 - "Content-Type": "application/json",  
892 - }  
893 - shared_context_key = _hash_text(shared_context)  
894 - localized_tail_key = _hash_text(f"{target_lang}\n{user_prompt}\n{assistant_prefix}")  
895 - combined_user_prompt = f"{shared_context.rstrip()}\n\n{user_prompt.strip()}"  
896 -  
897 - payload = {  
898 - "model": MODEL_NAME,  
899 - "messages": [  
900 - {  
901 - "role": "system",  
902 - "content": SYSTEM_MESSAGE,  
903 - },  
904 - {  
905 - "role": "user",  
906 - "content": combined_user_prompt,  
907 - },  
908 - {  
909 - "role": "assistant",  
910 - "content": assistant_prefix,  
911 - "partial": True,  
912 - },  
913 - ],  
914 - "temperature": 0.3,  
915 - "top_p": 0.8,  
916 - }  
917 -  
918 - request_data = {  
919 - "headers": {k: v for k, v in headers.items() if k != "Authorization"},  
920 - "payload": payload,  
921 - }  
922 -  
923 - if _mark_shared_context_logged_once(shared_context_key):  
924 - logger.info(f"\n{'=' * 80}")  
925 - logger.info(  
926 - "LLM Shared Context [model=%s, kind=%s, shared_key=%s, chars=%s] (logged once per process key)",  
927 - MODEL_NAME,  
928 - analysis_kind,  
929 - shared_context_key,  
930 - len(shared_context),  
931 - )  
932 - logger.info("\nSystem Message:\n%s", SYSTEM_MESSAGE)  
933 - logger.info("\nShared Context:\n%s", shared_context)  
934 -  
935 - verbose_logger.info(f"\n{'=' * 80}")  
936 - verbose_logger.info(  
937 - "LLM Request [model=%s, kind=%s, lang=%s, shared_key=%s, tail_key=%s]:",  
938 - MODEL_NAME,  
939 - analysis_kind,  
940 - target_lang,  
941 - shared_context_key,  
942 - localized_tail_key,  
943 - )  
944 - verbose_logger.info(json.dumps(request_data, ensure_ascii=False, indent=2))  
945 - verbose_logger.info(f"\nCombined User Prompt:\n{combined_user_prompt}")  
946 - verbose_logger.info(f"\nShared Context:\n{shared_context}")  
947 - verbose_logger.info(f"\nLocalized Requirement:\n{user_prompt}")  
948 - verbose_logger.info(f"\nAssistant Prefix:\n{assistant_prefix}")  
949 -  
950 - logger.info(  
951 - "\nLLM Request Variant [kind=%s, lang=%s, shared_key=%s, tail_key=%s, prompt_chars=%s, prefix_chars=%s]",  
952 - analysis_kind,  
953 - target_lang,  
954 - shared_context_key,  
955 - localized_tail_key,  
956 - len(user_prompt),  
957 - len(assistant_prefix),  
958 - )  
959 - logger.info("\nLocalized Requirement:\n%s", user_prompt)  
960 - logger.info("\nAssistant Prefix:\n%s", assistant_prefix)  
961 -  
962 - # 创建session,禁用代理  
963 - session = requests.Session()  
964 - session.trust_env = False # 忽略系统代理设置  
965 -  
966 - try:  
967 - # 重试机制  
968 - for attempt in range(MAX_RETRIES):  
969 - try:  
970 - response = session.post(  
971 - f"{API_BASE_URL}/chat/completions",  
972 - headers=headers,  
973 - json=payload,  
974 - timeout=REQUEST_TIMEOUT,  
975 - proxies={"http": None, "https": None}, # 明确禁用代理  
976 - )  
977 -  
978 - response.raise_for_status()  
979 - result = response.json()  
980 - usage = result.get("usage") or {}  
981 -  
982 - verbose_logger.info(  
983 - "\nLLM Response [model=%s, kind=%s, lang=%s, shared_key=%s, tail_key=%s]:",  
984 - MODEL_NAME,  
985 - analysis_kind,  
986 - target_lang,  
987 - shared_context_key,  
988 - localized_tail_key,  
989 - )  
990 - verbose_logger.info(json.dumps(result, ensure_ascii=False, indent=2))  
991 -  
992 - generated_content = result["choices"][0]["message"]["content"]  
993 - full_markdown = _merge_partial_response(assistant_prefix, generated_content)  
994 -  
995 - logger.info(  
996 - "\nLLM Response Summary [kind=%s, lang=%s, shared_key=%s, tail_key=%s, generated_chars=%s, completion_tokens=%s, prompt_tokens=%s, total_tokens=%s]",  
997 - analysis_kind,  
998 - target_lang,  
999 - shared_context_key,  
1000 - localized_tail_key,  
1001 - len(generated_content or ""),  
1002 - usage.get("completion_tokens"),  
1003 - usage.get("prompt_tokens"),  
1004 - usage.get("total_tokens"),  
1005 - )  
1006 - logger.info("\nGenerated Content:\n%s", generated_content)  
1007 - logger.info("\nMerged Markdown:\n%s", full_markdown)  
1008 -  
1009 - verbose_logger.info(f"\nGenerated Content:\n{generated_content}")  
1010 - verbose_logger.info(f"\nMerged Markdown:\n{full_markdown}")  
1011 -  
1012 - return full_markdown, json.dumps(result, ensure_ascii=False)  
1013 -  
1014 - except requests.exceptions.ProxyError as e:  
1015 - logger.warning(f"Attempt {attempt + 1}/{MAX_RETRIES}: Proxy error - {str(e)}")  
1016 - if attempt < MAX_RETRIES - 1:  
1017 - logger.info(f"Retrying in {RETRY_DELAY} seconds...")  
1018 - time.sleep(RETRY_DELAY)  
1019 - else:  
1020 - raise  
1021 -  
1022 - except requests.exceptions.RequestException as e:  
1023 - logger.warning(f"Attempt {attempt + 1}/{MAX_RETRIES}: Request error - {str(e)}")  
1024 - if attempt < MAX_RETRIES - 1:  
1025 - logger.info(f"Retrying in {RETRY_DELAY} seconds...")  
1026 - time.sleep(RETRY_DELAY)  
1027 - else:  
1028 - raise  
1029 -  
1030 - except Exception as e:  
1031 - logger.error(f"Unexpected error on attempt {attempt + 1}/{MAX_RETRIES}: {str(e)}")  
1032 - if attempt < MAX_RETRIES - 1:  
1033 - logger.info(f"Retrying in {RETRY_DELAY} seconds...")  
1034 - time.sleep(RETRY_DELAY)  
1035 - else:  
1036 - raise  
1037 -  
1038 - finally:  
1039 - session.close()  
1040 -  
1041 -  
1042 -def parse_markdown_table(  
1043 - markdown_content: str,  
1044 - analysis_kind: str = "content",  
1045 - category_taxonomy_profile: Optional[str] = None,  
1046 -) -> List[Dict[str, str]]:  
1047 - """解析markdown表格内容"""  
1048 - schema = _get_analysis_schema(  
1049 - analysis_kind,  
1050 - category_taxonomy_profile=category_taxonomy_profile,  
1051 - )  
1052 - lines = markdown_content.strip().split("\n")  
1053 - data = []  
1054 - data_started = False  
1055 -  
1056 - for line in lines:  
1057 - line = line.strip()  
1058 - if not line:  
1059 - continue  
1060 -  
1061 - # 表格行处理  
1062 - if line.startswith("|"):  
1063 - # 分隔行(---- 或 :---: 等;允许空格,如 "| ---- | ---- |")  
1064 - sep_chars = line.replace("|", "").strip().replace(" ", "")  
1065 - if sep_chars and set(sep_chars) <= {"-", ":"}:  
1066 - data_started = True  
1067 - continue  
1068 -  
1069 - # 首个表头行:无论语言如何,统一跳过  
1070 - if not data_started:  
1071 - # 等待下一行数据行  
1072 - continue  
1073 -  
1074 - # 解析数据行  
1075 - parts = [p.strip() for p in line.split("|")]  
1076 - if parts and parts[0] == "":  
1077 - parts = parts[1:]  
1078 - if parts and parts[-1] == "":  
1079 - parts = parts[:-1]  
1080 -  
1081 - if len(parts) >= 2:  
1082 - row = {"seq_no": parts[0]}  
1083 - for field_index, field_name in enumerate(schema.result_fields, start=1):  
1084 - cell = parts[field_index] if len(parts) > field_index else ""  
1085 - row[field_name] = _normalize_markdown_table_cell(cell)  
1086 - data.append(row)  
1087 -  
1088 - return data  
1089 -  
1090 -  
1091 -def _log_parsed_result_quality(  
1092 - batch_data: List[Dict[str, str]],  
1093 - parsed_results: List[Dict[str, str]],  
1094 - target_lang: str,  
1095 - batch_num: int,  
1096 - analysis_kind: str,  
1097 - category_taxonomy_profile: Optional[str] = None,  
1098 -) -> None:  
1099 - schema = _get_analysis_schema(  
1100 - analysis_kind,  
1101 - category_taxonomy_profile=category_taxonomy_profile,  
1102 - )  
1103 - expected = len(batch_data)  
1104 - actual = len(parsed_results)  
1105 - if actual != expected:  
1106 - logger.warning(  
1107 - "Parsed row count mismatch for kind=%s batch=%s lang=%s: expected=%s actual=%s",  
1108 - analysis_kind,  
1109 - batch_num,  
1110 - target_lang,  
1111 - expected,  
1112 - actual,  
1113 - )  
1114 -  
1115 - if not schema.quality_fields:  
1116 - logger.info(  
1117 - "Parsed Quality Summary [kind=%s, batch=%s, lang=%s]: rows=%s/%s",  
1118 - analysis_kind,  
1119 - batch_num,  
1120 - target_lang,  
1121 - actual,  
1122 - expected,  
1123 - )  
1124 - return  
1125 -  
1126 - missing_summary = ", ".join(  
1127 - f"missing_{field}="  
1128 - f"{sum(1 for item in parsed_results if not str(item.get(field) or '').strip())}"  
1129 - for field in schema.quality_fields  
1130 - )  
1131 - logger.info(  
1132 - "Parsed Quality Summary [kind=%s, batch=%s, lang=%s]: rows=%s/%s, %s",  
1133 - analysis_kind,  
1134 - batch_num,  
1135 - target_lang,  
1136 - actual,  
1137 - expected,  
1138 - missing_summary,  
1139 - )  
1140 -  
1141 -  
1142 -def process_batch(  
1143 - batch_data: List[Dict[str, str]],  
1144 - batch_num: int,  
1145 - target_lang: str = "zh",  
1146 - analysis_kind: str = "content",  
1147 - category_taxonomy_profile: Optional[str] = None,  
1148 -) -> List[Dict[str, Any]]:  
1149 - """处理一个批次的数据"""  
1150 - schema = _get_analysis_schema(  
1151 - analysis_kind,  
1152 - category_taxonomy_profile=category_taxonomy_profile,  
1153 - )  
1154 - logger.info(f"\n{'#' * 80}")  
1155 - logger.info(  
1156 - "Processing Batch %s (%s items, kind=%s)",  
1157 - batch_num,  
1158 - len(batch_data),  
1159 - analysis_kind,  
1160 - )  
1161 -  
1162 - # 创建提示词  
1163 - shared_context, user_prompt, assistant_prefix = create_prompt(  
1164 - batch_data,  
1165 - target_lang=target_lang,  
1166 - analysis_kind=analysis_kind,  
1167 - category_taxonomy_profile=category_taxonomy_profile,  
1168 - )  
1169 -  
1170 - # 如果提示词创建失败(例如不支持的 target_lang),本次批次整体失败,不再继续调用 LLM  
1171 - if shared_context is None or user_prompt is None or assistant_prefix is None:  
1172 - logger.error(  
1173 - "Failed to create prompt for batch %s, kind=%s, target_lang=%s; "  
1174 - "marking entire batch as failed without calling LLM",  
1175 - batch_num,  
1176 - analysis_kind,  
1177 - target_lang,  
1178 - )  
1179 - return [  
1180 - _make_empty_analysis_result(  
1181 - item,  
1182 - target_lang,  
1183 - schema,  
1184 - error=f"prompt_creation_failed: unsupported target_lang={target_lang}",  
1185 - )  
1186 - for item in batch_data  
1187 - ]  
1188 -  
1189 - # 调用LLM  
1190 - try:  
1191 - raw_response, full_response_json = call_llm(  
1192 - shared_context,  
1193 - user_prompt,  
1194 - assistant_prefix,  
1195 - target_lang=target_lang,  
1196 - analysis_kind=analysis_kind,  
1197 - )  
1198 -  
1199 - # 解析结果  
1200 - parsed_results = parse_markdown_table(  
1201 - raw_response,  
1202 - analysis_kind=analysis_kind,  
1203 - category_taxonomy_profile=category_taxonomy_profile,  
1204 - )  
1205 - _log_parsed_result_quality(  
1206 - batch_data,  
1207 - parsed_results,  
1208 - target_lang,  
1209 - batch_num,  
1210 - analysis_kind,  
1211 - category_taxonomy_profile,  
1212 - )  
1213 -  
1214 - logger.info(f"\nParsed Results ({len(parsed_results)} items):")  
1215 - logger.info(json.dumps(parsed_results, ensure_ascii=False, indent=2))  
1216 -  
1217 - # 映射回原始ID  
1218 - results_with_ids = []  
1219 - for i, parsed_item in enumerate(parsed_results):  
1220 - if i < len(batch_data):  
1221 - source_product = batch_data[i]  
1222 - result = _normalize_analysis_result(  
1223 - parsed_item,  
1224 - product=source_product,  
1225 - target_lang=target_lang,  
1226 - schema=schema,  
1227 - )  
1228 - results_with_ids.append(result)  
1229 - logger.info(  
1230 - "Mapped: kind=%s seq=%s -> original_id=%s",  
1231 - analysis_kind,  
1232 - parsed_item.get("seq_no"),  
1233 - source_product.get("id"),  
1234 - )  
1235 -  
1236 - # 保存批次 JSON 日志到独立文件  
1237 - batch_log = {  
1238 - "batch_num": batch_num,  
1239 - "analysis_kind": analysis_kind,  
1240 - "timestamp": datetime.now().isoformat(),  
1241 - "input_products": batch_data,  
1242 - "raw_response": raw_response,  
1243 - "full_response_json": full_response_json,  
1244 - "parsed_results": parsed_results,  
1245 - "final_results": results_with_ids,  
1246 - }  
1247 -  
1248 - # 并发写 batch json 日志时,保证文件名唯一避免覆盖  
1249 - batch_call_id = uuid.uuid4().hex[:12]  
1250 - batch_log_file = (  
1251 - LOG_DIR  
1252 - / f"batch_{analysis_kind}_{batch_num:04d}_{timestamp}_{batch_call_id}.json"  
1253 - )  
1254 - with open(batch_log_file, "w", encoding="utf-8") as f:  
1255 - json.dump(batch_log, f, ensure_ascii=False, indent=2)  
1256 -  
1257 - logger.info(f"Batch log saved to: {batch_log_file}")  
1258 -  
1259 - return results_with_ids  
1260 -  
1261 - except Exception as e:  
1262 - logger.error(f"Error processing batch {batch_num}: {str(e)}", exc_info=True)  
1263 - # 返回空结果,保持ID映射  
1264 - return [  
1265 - _make_empty_analysis_result(item, target_lang, schema, error=str(e))  
1266 - for item in batch_data  
1267 - ]  
1268 -  
1269 -  
1270 -def analyze_products(  
1271 - products: List[Dict[str, str]],  
1272 - target_lang: str = "zh",  
1273 - batch_size: Optional[int] = None,  
1274 - tenant_id: Optional[str] = None,  
1275 - analysis_kind: str = "content",  
1276 - category_taxonomy_profile: Optional[str] = None,  
1277 -) -> List[Dict[str, Any]]:  
1278 - """  
1279 - 库调用入口:根据输入+语言,返回锚文本及各维度信息。  
1280 -  
1281 - Args:  
1282 - products: [{"id": "...", "title": "..."}]  
1283 - target_lang: 输出语言  
1284 - batch_size: 批大小,默认使用全局 BATCH_SIZE  
1285 - """  
1286 - if not API_KEY:  
1287 - raise RuntimeError("DASHSCOPE_API_KEY is not set, cannot call LLM")  
1288 -  
1289 - if not products:  
1290 - return []  
1291 -  
1292 - _get_analysis_schema(  
1293 - analysis_kind,  
1294 - category_taxonomy_profile=category_taxonomy_profile,  
1295 - )  
1296 - results_by_index: List[Optional[Dict[str, Any]]] = [None] * len(products)  
1297 - uncached_items: List[Tuple[int, Dict[str, str]]] = []  
1298 -  
1299 - for idx, product in enumerate(products):  
1300 - title = str(product.get("title") or "").strip()  
1301 - if not title:  
1302 - uncached_items.append((idx, product))  
1303 - continue  
1304 -  
1305 - cached = _get_cached_analysis_result(  
1306 - product,  
1307 - target_lang,  
1308 - analysis_kind,  
1309 - category_taxonomy_profile=category_taxonomy_profile,  
1310 - )  
1311 - if cached:  
1312 - logger.info(  
1313 - f"[analyze_products] Cache hit for title='{title[:50]}...', "  
1314 - f"kind={analysis_kind}, lang={target_lang}"  
1315 - )  
1316 - results_by_index[idx] = cached  
1317 - continue  
1318 -  
1319 - uncached_items.append((idx, product))  
1320 -  
1321 - if not uncached_items:  
1322 - return [item for item in results_by_index if item is not None]  
1323 -  
1324 - # call_llm 一次处理上限固定为 BATCH_SIZE(默认 20):  
1325 - # - 尽可能攒批处理;  
1326 - # - 即便调用方传入更大的 batch_size,也会自动按上限拆批。  
1327 - req_bs = BATCH_SIZE if batch_size is None else int(batch_size)  
1328 - bs = max(1, min(req_bs, BATCH_SIZE))  
1329 - total_batches = (len(uncached_items) + bs - 1) // bs  
1330 -  
1331 - batch_jobs: List[Tuple[int, List[Tuple[int, Dict[str, str]]], List[Dict[str, str]]]] = []  
1332 - for i in range(0, len(uncached_items), bs):  
1333 - batch_num = i // bs + 1  
1334 - batch_slice = uncached_items[i : i + bs]  
1335 - batch = [item for _, item in batch_slice]  
1336 - batch_jobs.append((batch_num, batch_slice, batch))  
1337 -  
1338 - # 只有一个批次时走串行,减少线程池创建开销与日志/日志文件的不可控交织  
1339 - if total_batches <= 1 or CONTENT_UNDERSTANDING_MAX_WORKERS <= 1:  
1340 - for batch_num, batch_slice, batch in batch_jobs:  
1341 - logger.info(  
1342 - f"[analyze_products] Processing batch {batch_num}/{total_batches}, "  
1343 - f"size={len(batch)}, kind={analysis_kind}, target_lang={target_lang}"  
1344 - )  
1345 - batch_results = process_batch(  
1346 - batch,  
1347 - batch_num=batch_num,  
1348 - target_lang=target_lang,  
1349 - analysis_kind=analysis_kind,  
1350 - category_taxonomy_profile=category_taxonomy_profile,  
1351 - )  
1352 -  
1353 - for (original_idx, product), item in zip(batch_slice, batch_results):  
1354 - results_by_index[original_idx] = item  
1355 - title_input = str(item.get("title_input") or "").strip()  
1356 - if not title_input:  
1357 - continue  
1358 - if item.get("error"):  
1359 - # 不缓存错误结果,避免放大临时故障  
1360 - continue  
1361 - try:  
1362 - _set_cached_analysis_result(  
1363 - product,  
1364 - target_lang,  
1365 - item,  
1366 - analysis_kind,  
1367 - category_taxonomy_profile=category_taxonomy_profile,  
1368 - )  
1369 - except Exception:  
1370 - # 已在内部记录 warning  
1371 - pass  
1372 - else:  
1373 - max_workers = min(CONTENT_UNDERSTANDING_MAX_WORKERS, len(batch_jobs))  
1374 - logger.info(  
1375 - "[analyze_products] Using ThreadPoolExecutor for uncached batches: "  
1376 - "max_workers=%s, total_batches=%s, bs=%s, kind=%s, target_lang=%s",  
1377 - max_workers,  
1378 - total_batches,  
1379 - bs,  
1380 - analysis_kind,  
1381 - target_lang,  
1382 - )  
1383 -  
1384 - # 只把“LLM 调用 + markdown 解析”放到线程里;Redis get/set 保持在主线程,避免并发写入带来额外风险。  
1385 - # 注意:线程池是模块级单例,因此这里的 max_workers 主要用于日志语义(实际并发受单例池上限约束)。  
1386 - executor = _get_content_understanding_executor()  
1387 - future_by_batch_num: Dict[int, Any] = {}  
1388 - for batch_num, _batch_slice, batch in batch_jobs:  
1389 - future_by_batch_num[batch_num] = executor.submit(  
1390 - process_batch,  
1391 - batch,  
1392 - batch_num=batch_num,  
1393 - target_lang=target_lang,  
1394 - analysis_kind=analysis_kind,  
1395 - category_taxonomy_profile=category_taxonomy_profile,  
1396 - )  
1397 -  
1398 - # 按 batch_num 回填,确保输出稳定(results_by_index 是按原始 input index 映射的)  
1399 - for batch_num, batch_slice, _batch in batch_jobs:  
1400 - batch_results = future_by_batch_num[batch_num].result()  
1401 - for (original_idx, product), item in zip(batch_slice, batch_results):  
1402 - results_by_index[original_idx] = item  
1403 - title_input = str(item.get("title_input") or "").strip()  
1404 - if not title_input:  
1405 - continue  
1406 - if item.get("error"):  
1407 - # 不缓存错误结果,避免放大临时故障  
1408 - continue  
1409 - try:  
1410 - _set_cached_analysis_result(  
1411 - product,  
1412 - target_lang,  
1413 - item,  
1414 - analysis_kind,  
1415 - category_taxonomy_profile=category_taxonomy_profile,  
1416 - )  
1417 - except Exception:  
1418 - # 已在内部记录 warning  
1419 - pass  
1420 -  
1421 - return [item for item in results_by_index if item is not None]  
indexer/product_enrich_prompts.py deleted
@@ -1,849 +0,0 @@ @@ -1,849 +0,0 @@
1 -#!/usr/bin/env python3  
2 -  
3 -from typing import Any, Dict, Tuple  
4 -  
5 -SYSTEM_MESSAGE = (  
6 - "You are an e-commerce product annotator. "  
7 - "Continue the provided assistant Markdown table prefix. "  
8 - "Do not repeat or modify the prefix, and do not add explanations outside the table."  
9 -)  
10 -  
11 -SHARED_ANALYSIS_INSTRUCTION = """Analyze each input product text and fill these columns:  
12 -  
13 -1. Product title: a natural, localized product name based on the input text  
14 -2. Category path: a concise category hierarchy from broad to specific, separated by ">"  
15 -3. Fine-grained tags: concise tags for style, features, design details, function, or standout selling points  
16 -4. Target audience: gender, age group, body type, or suitable users when clearly implied  
17 -5. Usage scene: likely occasions, settings, or use cases  
18 -6. Applicable season: relevant season(s) based on the product text  
19 -7. Key attributes: core product attributes and specifications. Depending on the item type, this may include fit, silhouette, length, sleeve type, neckline, waistline, closure, pattern, design details, structure, or other relevant attribute dimensions  
20 -8. Material description: material, fabric, texture, or construction description  
21 -9. Functional features: practical or performance-related functions such as stretch, breathability, warmth, support, storage, protection, or ease of wear  
22 -10. Anchor text: a search-oriented keyword string covering product type, category intent, attributes, design cues, usage scenarios, and strong shopping phrases  
23 -  
24 -Rules:  
25 -- Keep the input order and row count exactly the same.  
26 -- Infer only from the provided input product text; if uncertain, prefer concise and broadly correct ecommerce wording.  
27 -- Keep category paths concise and use ">" as the separator.  
28 -- For columns with multiple values, the localized output requirement will define the delimiter.  
29 -  
30 -Input product list:  
31 -"""  
32 -  
33 -USER_INSTRUCTION_TEMPLATE = """Please strictly return a Markdown table following the given columns in the specified language. For any column containing multiple values, separate them with commas. Do not add any other explanation.  
34 -Language: {language}"""  
35 -  
36 -def _taxonomy_field(  
37 - key: str,  
38 - label: str,  
39 - description: str,  
40 - zh_label: str | None = None,  
41 -) -> Dict[str, str]:  
42 - return {  
43 - "key": key,  
44 - "label": label,  
45 - "description": description,  
46 - "zh_label": zh_label or label,  
47 - }  
48 -  
49 -  
50 -def _build_taxonomy_shared_instruction(profile_label: str, fields: Tuple[Dict[str, str], ...]) -> str:  
51 - lines = [  
52 - f"Analyze each input product text and fill the columns below using a {profile_label} attribute taxonomy.",  
53 - "",  
54 - "Output columns:",  
55 - ]  
56 - for idx, field in enumerate(fields, start=1):  
57 - lines.append(f"{idx}. {field['label']}: {field['description']}")  
58 - lines.extend(  
59 - [  
60 - "",  
61 - "Rules:",  
62 - "- Keep the same row order and row count as input.",  
63 - "- Leave blank if not applicable, unmentioned, or unsupported.",  
64 - "- Use concise, standardized ecommerce wording.",  
65 - "- If multiple values, separate with commas.",  
66 - "",  
67 - "Input product list:",  
68 - ]  
69 - )  
70 - return "\n".join(lines)  
71 -  
72 -  
73 -def _make_taxonomy_profile(  
74 - profile_label: str,  
75 - fields: Tuple[Dict[str, str], ...],  
76 -) -> Dict[str, Any]:  
77 - headers = {  
78 - "en": ["No.", *[field["label"] for field in fields]],  
79 - "zh": ["序号", *[field["zh_label"] for field in fields]],  
80 - }  
81 - return {  
82 - "profile_label": profile_label,  
83 - "fields": fields,  
84 - "shared_instruction": _build_taxonomy_shared_instruction(profile_label, fields),  
85 - "markdown_table_headers": headers,  
86 - }  
87 -  
88 -  
89 -APPAREL_TAXONOMY_FIELDS = (  
90 - _taxonomy_field("product_type", "Product Type", "concise ecommerce apparel category label, not a full marketing title", "品类"),  
91 - _taxonomy_field("target_gender", "Target Gender", "intended gender only if clearly implied", "目标性别"),  
92 - _taxonomy_field("age_group", "Age Group", "only if clearly implied, e.g. adults, kids, teens, toddlers, babies", "年龄段"),  
93 - _taxonomy_field("season", "Season", "season(s) or all-season suitability only if supported", "适用季节"),  
94 - _taxonomy_field("fit", "Fit", "body closeness, e.g. slim, regular, relaxed, oversized, fitted", "版型"),  
95 - _taxonomy_field("silhouette", "Silhouette", "overall garment shape, e.g. straight, A-line, boxy, tapered, bodycon, wide-leg", "廓形"),  
96 - _taxonomy_field("neckline", "Neckline", "neckline type when applicable, e.g. crew neck, V-neck, hooded, collared, square neck", "领型"),  
97 - _taxonomy_field("sleeve_length_type", "Sleeve Length Type", "sleeve length only, e.g. sleeveless, short sleeve, long sleeve, three-quarter sleeve", "袖长类型"),  
98 - _taxonomy_field("sleeve_style", "Sleeve Style", "sleeve design only, e.g. puff sleeve, raglan sleeve, batwing sleeve, bell sleeve", "袖型"),  
99 - _taxonomy_field("strap_type", "Strap Type", "strap design when applicable, e.g. spaghetti strap, wide strap, halter strap, adjustable strap", "肩带设计"),  
100 - _taxonomy_field("rise_waistline", "Rise / Waistline", "waist placement when applicable, e.g. high rise, mid rise, low rise, empire waist", "腰型"),  
101 - _taxonomy_field("leg_shape", "Leg Shape", "for bottoms only, e.g. straight leg, wide leg, flare leg, tapered leg, skinny leg", "裤型"),  
102 - _taxonomy_field("skirt_shape", "Skirt Shape", "for skirts only, e.g. A-line, pleated, pencil, mermaid", "裙型"),  
103 - _taxonomy_field("length_type", "Length Type", "design length only, not size, e.g. cropped, regular, longline, mini, midi, maxi, ankle length, full length", "长度类型"),  
104 - _taxonomy_field("closure_type", "Closure Type", "fastening method when applicable, e.g. zipper, button, drawstring, elastic waist, hook-and-loop", "闭合方式"),  
105 - _taxonomy_field("design_details", "Design Details", "construction or visual details, e.g. ruched, ruffled, pleated, cut-out, layered, distressed, split hem", "设计细节"),  
106 - _taxonomy_field("fabric", "Fabric", "fabric type only, e.g. denim, knit, chiffon, jersey, fleece, cotton twill", "面料"),  
107 - _taxonomy_field("material_composition", "Material Composition", "fiber content or blend only if stated, e.g. cotton, polyester, spandex, linen blend, 95% cotton 5% elastane", "成分"),  
108 - _taxonomy_field("fabric_properties", "Fabric Properties", "inherent fabric traits, e.g. stretch, breathable, lightweight, soft-touch, water-resistant", "面料特性"),  
109 - _taxonomy_field("clothing_features", "Clothing Features", "product features, e.g. lined, reversible, hooded, packable, padded, pocketed", "服装特征"),  
110 - _taxonomy_field("functional_benefits", "Functional Benefits", "wearer benefits, e.g. moisture-wicking, thermal insulation, UV protection, easy care, supportive compression", "功能"),  
111 - _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),  
112 - _taxonomy_field("color_family", "Color Family", "normalized broad retail color group, e.g. black, white, blue, green, red, pink, beige, brown, gray", "色系"),  
113 - _taxonomy_field("print_pattern", "Print / Pattern", "surface pattern when applicable, e.g. solid, striped, plaid, floral, graphic, animal print", "印花 / 图案"),  
114 - _taxonomy_field("occasion_end_use", "Occasion / End Use", "likely use occasion only if supported, e.g. office, casual wear, streetwear, lounge, workout, outdoor", "适用场景"),  
115 - _taxonomy_field("style_aesthetic", "Style Aesthetic", "overall style only if supported, e.g. minimalist, streetwear, athleisure, smart casual, romantic, playful", "风格"),  
116 -)  
117 -  
118 -THREE_C_TAXONOMY_FIELDS = (  
119 - _taxonomy_field("product_type", "Product Type", "concise 3C accessory or peripheral category label", "品类"),  
120 - _taxonomy_field("compatible_device", "Compatible Device / Model", "supported device family, series, model, or form factor when clearly stated", "适配设备 / 型号"),  
121 - _taxonomy_field("connectivity", "Connectivity", "connection method such as wired, wireless, Bluetooth, Wi-Fi, NFC, or 2.4G", "连接方式"),  
122 - _taxonomy_field("interface_port_type", "Interface / Port Type", "relevant connector or port, e.g. USB-C, Lightning, HDMI, AUX, RJ45", "接口 / 端口类型"),  
123 - _taxonomy_field("power_charging", "Power Source / Charging", "charging or power mode, e.g. battery powered, fast charging, rechargeable, plug-in", "供电 / 充电方式"),  
124 - _taxonomy_field("key_features", "Key Features", "primary hardware features such as noise cancelling, foldable, magnetic, backlit, waterproof", "关键特征"),  
125 - _taxonomy_field("material_finish", "Material / Finish", "main material or exterior finish when supported", "材质 / 表面处理"),  
126 - _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),  
127 - _taxonomy_field("pack_size", "Pack Size", "unit count or bundle size when stated", "包装规格"),  
128 - _taxonomy_field("use_case", "Use Case", "intended usage such as travel, office, gaming, car, charging, streaming", "使用场景"),  
129 -)  
130 -  
131 -BAGS_TAXONOMY_FIELDS = (  
132 - _taxonomy_field("product_type", "Product Type", "concise bag category such as backpack, tote bag, crossbody bag, luggage, or wallet", "品类"),  
133 - _taxonomy_field("target_gender", "Target Gender", "intended gender only if clearly implied", "目标性别"),  
134 - _taxonomy_field("carry_style", "Carry Style", "how the bag is worn or carried, e.g. handheld, shoulder, crossbody, backpack", "携带方式"),  
135 - _taxonomy_field("size_capacity", "Size / Capacity", "size tier or capacity when supported, e.g. mini, large capacity, 20L", "尺寸 / 容量"),  
136 - _taxonomy_field("material", "Material", "main bag material such as leather, nylon, canvas, PU, straw", "材质"),  
137 - _taxonomy_field("closure_type", "Closure Type", "bag closure such as zipper, flap, buckle, drawstring, magnetic snap", "闭合方式"),  
138 - _taxonomy_field("structure_compartments", "Structure / Compartments", "organizational structure such as multi-pocket, laptop sleeve, card slots, expandable", "结构 / 分层"),  
139 - _taxonomy_field("strap_handle_type", "Strap / Handle Type", "strap or handle design such as chain strap, top handle, adjustable strap", "肩带 / 提手类型"),  
140 - _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),  
141 - _taxonomy_field("occasion_end_use", "Occasion / End Use", "likely use such as commute, travel, evening, school, casual", "适用场景"),  
142 -)  
143 -  
144 -PET_SUPPLIES_TAXONOMY_FIELDS = (  
145 - _taxonomy_field("product_type", "Product Type", "concise pet supplies category label", "品类"),  
146 - _taxonomy_field("pet_type", "Pet Type", "target pet such as dog, cat, bird, fish, hamster", "宠物类型"),  
147 - _taxonomy_field("breed_size", "Breed Size", "pet size or breed size when stated, e.g. small breed, large dogs", "体型 / 品种大小"),  
148 - _taxonomy_field("life_stage", "Life Stage", "pet age stage when supported, e.g. puppy, kitten, adult, senior", "成长阶段"),  
149 - _taxonomy_field("material_ingredients", "Material / Ingredients", "main material or ingredient composition when supported", "材质 / 成分"),  
150 - _taxonomy_field("flavor_scent", "Flavor / Scent", "flavor or scent when applicable", "口味 / 气味"),  
151 - _taxonomy_field("key_features", "Key Features", "primary attributes such as interactive, leak-proof, orthopedic, washable, elevated", "关键特征"),  
152 - _taxonomy_field("functional_benefits", "Functional Benefits", "benefits such as dental care, calming, digestion support, joint support", "功能"),  
153 - _taxonomy_field("size_capacity", "Size / Capacity", "size, count, or net content when stated", "尺寸 / 容量"),  
154 - _taxonomy_field("use_scenario", "Use Scenario", "usage such as feeding, training, grooming, travel, indoor play", "使用场景"),  
155 -)  
156 -  
157 -ELECTRONICS_TAXONOMY_FIELDS = (  
158 - _taxonomy_field("product_type", "Product Type", "concise electronics device or component category label", "品类"),  
159 - _taxonomy_field("device_category", "Device Category / Compatibility", "supported platform, component class, or compatible device family when stated", "设备类别 / 兼容性"),  
160 - _taxonomy_field("power_voltage", "Power / Voltage", "power, voltage, wattage, or battery spec when supported", "功率 / 电压"),  
161 - _taxonomy_field("connectivity", "Connectivity", "connection method such as wired, Bluetooth, Wi-Fi, RF, or smart app control", "连接方式"),  
162 - _taxonomy_field("interface_port_type", "Interface / Port Type", "relevant port or interface such as USB-C, AC plug type, HDMI, SATA", "接口 / 端口类型"),  
163 - _taxonomy_field("capacity_storage", "Capacity / Storage", "capacity or storage spec such as 256GB, 2TB, 5000mAh", "容量 / 存储"),  
164 - _taxonomy_field("key_features", "Key Features", "main product features such as touch control, HD display, noise reduction, smart control", "关键特征"),  
165 - _taxonomy_field("material_finish", "Material / Finish", "main housing material or finish when supported", "材质 / 表面处理"),  
166 - _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),  
167 - _taxonomy_field("use_case", "Use Case", "intended use such as home entertainment, office, charging, security, repair", "使用场景"),  
168 -)  
169 -  
170 -OUTDOOR_TAXONOMY_FIELDS = (  
171 - _taxonomy_field("product_type", "Product Type", "concise outdoor gear category label", "品类"),  
172 - _taxonomy_field("activity_type", "Activity Type", "primary outdoor activity such as camping, hiking, fishing, climbing, travel", "活动类型"),  
173 - _taxonomy_field("season_weather", "Season / Weather", "season or weather suitability when supported", "适用季节 / 天气"),  
174 - _taxonomy_field("material", "Material", "main material such as aluminum, ripstop nylon, stainless steel, EVA", "材质"),  
175 - _taxonomy_field("capacity_size", "Capacity / Size", "size, length, or capacity when stated", "容量 / 尺寸"),  
176 - _taxonomy_field("protection_resistance", "Protection / Resistance", "resistance or protection such as waterproof, UV resistant, windproof", "防护 / 耐受性"),  
177 - _taxonomy_field("key_features", "Key Features", "primary gear attributes such as foldable, lightweight, insulated, non-slip", "关键特征"),  
178 - _taxonomy_field("portability_packability", "Portability / Packability", "carry or storage trait such as collapsible, compact, ultralight, packable", "便携 / 收纳性"),  
179 - _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),  
180 - _taxonomy_field("use_scenario", "Use Scenario", "likely use setting such as campsite, trail, survival kit, beach, picnic", "使用场景"),  
181 -)  
182 -  
183 -HOME_APPLIANCES_TAXONOMY_FIELDS = (  
184 - _taxonomy_field("product_type", "Product Type", "concise home appliance category label", "品类"),  
185 - _taxonomy_field("appliance_category", "Appliance Category", "functional class such as kitchen appliance, cleaning appliance, personal care appliance", "家电类别"),  
186 - _taxonomy_field("power_voltage", "Power / Voltage", "wattage, voltage, plug type, or power supply when supported", "功率 / 电压"),  
187 - _taxonomy_field("capacity_coverage", "Capacity / Coverage", "capacity or coverage metric such as 1.5L, 20L, 40sqm", "容量 / 覆盖范围"),  
188 - _taxonomy_field("control_method", "Control Method", "operation method such as touch, knob, remote, app control", "控制方式"),  
189 - _taxonomy_field("installation_type", "Installation Type", "setup style such as countertop, handheld, portable, wall-mounted, built-in", "安装方式"),  
190 - _taxonomy_field("key_features", "Key Features", "main product features such as timer, steam, HEPA filter, self-cleaning", "关键特征"),  
191 - _taxonomy_field("material_finish", "Material / Finish", "main material or exterior finish when supported", "材质 / 表面处理"),  
192 - _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),  
193 - _taxonomy_field("use_scenario", "Use Scenario", "intended use such as cooking, cleaning, grooming, cooling, air treatment", "使用场景"),  
194 -)  
195 -  
196 -HOME_LIVING_TAXONOMY_FIELDS = (  
197 - _taxonomy_field("product_type", "Product Type", "concise home and living category label", "品类"),  
198 - _taxonomy_field("room_placement", "Room / Placement", "intended room or placement such as bedroom, kitchen, bathroom, desktop", "适用空间 / 摆放位置"),  
199 - _taxonomy_field("material", "Material", "main material such as wood, ceramic, cotton, glass, metal", "材质"),  
200 - _taxonomy_field("style", "Style", "home style such as modern, farmhouse, minimalist, boho, Nordic", "风格"),  
201 - _taxonomy_field("size_dimensions", "Size / Dimensions", "size or dimensions when stated", "尺寸 / 规格"),  
202 - _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),  
203 - _taxonomy_field("pattern_finish", "Pattern / Finish", "surface pattern or finish such as solid, marble, matte, ribbed", "图案 / 表面处理"),  
204 - _taxonomy_field("key_features", "Key Features", "main product features such as stackable, washable, blackout, space-saving", "关键特征"),  
205 - _taxonomy_field("assembly_installation", "Assembly / Installation", "assembly or installation trait when supported", "组装 / 安装"),  
206 - _taxonomy_field("use_scenario", "Use Scenario", "intended use such as storage, dining, decor, sleep, organization", "使用场景"),  
207 -)  
208 -  
209 -WIGS_TAXONOMY_FIELDS = (  
210 - _taxonomy_field("product_type", "Product Type", "concise wig or hairpiece category label", "品类"),  
211 - _taxonomy_field("hair_material", "Hair Material", "hair material such as human hair, synthetic fiber, heat-resistant fiber", "发丝材质"),  
212 - _taxonomy_field("hair_texture", "Hair Texture", "texture or curl pattern such as straight, body wave, curly, kinky", "发质纹理"),  
213 - _taxonomy_field("hair_length", "Hair Length", "hair length when stated", "发长"),  
214 - _taxonomy_field("hair_color", "Hair Color", "specific hair color or blend when available", "发色"),  
215 - _taxonomy_field("cap_construction", "Cap Construction", "cap type such as full lace, lace front, glueless, U part", "帽网结构"),  
216 - _taxonomy_field("lace_area_part_type", "Lace Area / Part Type", "lace size or part style such as 13x4 lace, middle part, T part", "蕾丝面积 / 分缝类型"),  
217 - _taxonomy_field("density_volume", "Density / Volume", "hair density or fullness when supported", "密度 / 发量"),  
218 - _taxonomy_field("style_bang_type", "Style / Bang Type", "style cue such as bob, pixie, layered, with bangs", "款式 / 刘海类型"),  
219 - _taxonomy_field("occasion_end_use", "Occasion / End Use", "intended use such as daily wear, cosplay, protective style, party", "适用场景"),  
220 -)  
221 -  
222 -BEAUTY_TAXONOMY_FIELDS = (  
223 - _taxonomy_field("product_type", "Product Type", "concise beauty or cosmetics category label", "品类"),  
224 - _taxonomy_field("target_area", "Target Area", "target area such as face, lips, eyes, nails, hair, body", "适用部位"),  
225 - _taxonomy_field("skin_hair_type", "Skin Type / Hair Type", "suitable skin or hair type when supported", "肤质 / 发质"),  
226 - _taxonomy_field("finish_effect", "Finish / Effect", "cosmetic finish or effect such as matte, dewy, volumizing, brightening", "妆效 / 效果"),  
227 - _taxonomy_field("key_ingredients", "Key Ingredients", "notable ingredients when stated", "关键成分"),  
228 - _taxonomy_field("shade_color", "Shade / Color", "specific shade or color when available", "色号 / 颜色"),  
229 - _taxonomy_field("scent", "Scent", "fragrance or scent only when supported", "香味"),  
230 - _taxonomy_field("formulation", "Formulation", "product form such as cream, serum, powder, gel, stick", "剂型 / 形态"),  
231 - _taxonomy_field("functional_benefits", "Functional Benefits", "benefits such as hydration, anti-aging, long-wear, repair, sun protection", "功能"),  
232 - _taxonomy_field("use_scenario", "Use Scenario", "intended use such as daily routine, salon, travel, evening makeup", "使用场景"),  
233 -)  
234 -  
235 -ACCESSORIES_TAXONOMY_FIELDS = (  
236 - _taxonomy_field("product_type", "Product Type", "concise accessory category label such as necklace, watch, belt, hat, or sunglasses", "品类"),  
237 - _taxonomy_field("target_gender", "Target Gender", "intended gender only if clearly implied", "目标性别"),  
238 - _taxonomy_field("material", "Material", "main material such as alloy, leather, stainless steel, acetate, fabric", "材质"),  
239 - _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),  
240 - _taxonomy_field("pattern_finish", "Pattern / Finish", "surface treatment or style finish such as polished, textured, braided, rhinestone", "图案 / 表面处理"),  
241 - _taxonomy_field("closure_fastening", "Closure / Fastening", "fastening method when applicable", "闭合 / 固定方式"),  
242 - _taxonomy_field("size_fit", "Size / Fit", "size or fit information such as adjustable, one size, 42mm", "尺寸 / 适配"),  
243 - _taxonomy_field("style", "Style", "style cue such as minimalist, vintage, statement, sporty", "风格"),  
244 - _taxonomy_field("occasion_end_use", "Occasion / End Use", "likely use such as daily wear, formal, party, travel, sun protection", "适用场景"),  
245 - _taxonomy_field("set_pack_size", "Set / Pack Size", "set count or pack size when stated", "套装 / 规格"),  
246 -)  
247 -  
248 -TOYS_TAXONOMY_FIELDS = (  
249 - _taxonomy_field("product_type", "Product Type", "concise toy category label", "品类"),  
250 - _taxonomy_field("age_group", "Age Group", "intended age group when clearly implied", "年龄段"),  
251 - _taxonomy_field("character_theme", "Character / Theme", "licensed character, theme, or play theme when supported", "角色 / 主题"),  
252 - _taxonomy_field("material", "Material", "main toy material such as plush, plastic, wood, silicone", "材质"),  
253 - _taxonomy_field("power_source", "Power Source", "battery, rechargeable, wind-up, or non-powered when supported", "供电方式"),  
254 - _taxonomy_field("interactive_features", "Interactive Features", "interactive functions such as sound, lights, remote control, motion", "互动功能"),  
255 - _taxonomy_field("educational_play_value", "Educational / Play Value", "play value such as STEM, pretend play, sensory, puzzle solving", "教育 / 可玩性"),  
256 - _taxonomy_field("piece_count_size", "Piece Count / Size", "piece count or size when stated", "件数 / 尺寸"),  
257 - _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),  
258 - _taxonomy_field("use_scenario", "Use Scenario", "intended use such as indoor play, bath time, party favor, outdoor play", "使用场景"),  
259 -)  
260 -  
261 -SHOES_TAXONOMY_FIELDS = (  
262 - _taxonomy_field("product_type", "Product Type", "concise footwear category label", "品类"),  
263 - _taxonomy_field("target_gender", "Target Gender", "intended gender only if clearly implied", "目标性别"),  
264 - _taxonomy_field("age_group", "Age Group", "only if clearly implied", "年龄段"),  
265 - _taxonomy_field("closure_type", "Closure Type", "fastening method such as lace-up, slip-on, buckle, hook-and-loop", "闭合方式"),  
266 - _taxonomy_field("toe_shape", "Toe Shape", "toe shape when applicable, e.g. round toe, pointed toe, open toe", "鞋头形状"),  
267 - _taxonomy_field("heel_sole_type", "Heel Height / Sole Type", "heel or sole profile such as flat, block heel, wedge, platform, thick sole", "跟高 / 鞋底类型"),  
268 - _taxonomy_field("upper_material", "Upper Material", "main upper material such as leather, knit, canvas, mesh", "鞋面材质"),  
269 - _taxonomy_field("lining_insole_material", "Lining / Insole Material", "lining or insole material when supported", "里料 / 鞋垫材质"),  
270 - _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),  
271 - _taxonomy_field("occasion_end_use", "Occasion / End Use", "likely use such as running, casual, office, hiking, formal", "适用场景"),  
272 -)  
273 -  
274 -SPORTS_TAXONOMY_FIELDS = (  
275 - _taxonomy_field("product_type", "Product Type", "concise sports product category label", "品类"),  
276 - _taxonomy_field("sport_activity", "Sport / Activity", "primary sport or activity such as fitness, yoga, basketball, cycling, swimming", "运动 / 活动"),  
277 - _taxonomy_field("skill_level", "Skill Level", "target user level when supported, e.g. beginner, training, professional", "适用水平"),  
278 - _taxonomy_field("material", "Material", "main material such as EVA, carbon fiber, neoprene, latex", "材质"),  
279 - _taxonomy_field("size_capacity", "Size / Capacity", "size, weight, resistance level, or capacity when stated", "尺寸 / 容量"),  
280 - _taxonomy_field("protection_support", "Protection / Support", "support or protection function such as ankle support, shock absorption, impact protection", "防护 / 支撑"),  
281 - _taxonomy_field("key_features", "Key Features", "main features such as anti-slip, adjustable, foldable, quick-dry", "关键特征"),  
282 - _taxonomy_field("power_source", "Power Source", "battery, electric, or non-powered when applicable", "供电方式"),  
283 - _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),  
284 - _taxonomy_field("use_scenario", "Use Scenario", "intended use such as gym, home workout, field training, competition", "使用场景"),  
285 -)  
286 -  
287 -OTHERS_TAXONOMY_FIELDS = (  
288 - _taxonomy_field("product_type", "Product Type", "concise product category label, not a full marketing title", "品类"),  
289 - _taxonomy_field("product_category", "Product Category", "broader retail grouping when the specific product type is narrow", "商品类别"),  
290 - _taxonomy_field("target_user", "Target User", "intended user, audience, or recipient when clearly implied", "适用人群"),  
291 - _taxonomy_field("material_ingredients", "Material / Ingredients", "main material or ingredients when supported", "材质 / 成分"),  
292 - _taxonomy_field("key_features", "Key Features", "primary product attributes or standout features", "关键特征"),  
293 - _taxonomy_field("functional_benefits", "Functional Benefits", "practical benefits or performance advantages when supported", "功能"),  
294 - _taxonomy_field("size_capacity", "Size / Capacity", "size, count, weight, or capacity when stated", "尺寸 / 容量"),  
295 - _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),  
296 - _taxonomy_field("style_theme", "Style / Theme", "overall style, design theme, or visual direction when supported", "风格 / 主题"),  
297 - _taxonomy_field("use_scenario", "Use Scenario", "likely use occasion or application setting when supported", "使用场景"),  
298 -)  
299 -  
300 -CATEGORY_TAXONOMY_PROFILES: Dict[str, Dict[str, Any]] = {  
301 - "apparel": _make_taxonomy_profile(  
302 - "apparel",  
303 - APPAREL_TAXONOMY_FIELDS,  
304 - ),  
305 - "3c": _make_taxonomy_profile(  
306 - "3C",  
307 - THREE_C_TAXONOMY_FIELDS,  
308 - ),  
309 - "bags": _make_taxonomy_profile(  
310 - "bags",  
311 - BAGS_TAXONOMY_FIELDS,  
312 - ),  
313 - "pet_supplies": _make_taxonomy_profile(  
314 - "pet supplies",  
315 - PET_SUPPLIES_TAXONOMY_FIELDS,  
316 - ),  
317 - "electronics": _make_taxonomy_profile(  
318 - "electronics",  
319 - ELECTRONICS_TAXONOMY_FIELDS,  
320 - ),  
321 - "outdoor": _make_taxonomy_profile(  
322 - "outdoor products",  
323 - OUTDOOR_TAXONOMY_FIELDS,  
324 - ),  
325 - "home_appliances": _make_taxonomy_profile(  
326 - "home appliances",  
327 - HOME_APPLIANCES_TAXONOMY_FIELDS,  
328 - ),  
329 - "home_living": _make_taxonomy_profile(  
330 - "home and living",  
331 - HOME_LIVING_TAXONOMY_FIELDS,  
332 - ),  
333 - "wigs": _make_taxonomy_profile(  
334 - "wigs",  
335 - WIGS_TAXONOMY_FIELDS,  
336 - ),  
337 - "beauty": _make_taxonomy_profile(  
338 - "beauty and cosmetics",  
339 - BEAUTY_TAXONOMY_FIELDS,  
340 - ),  
341 - "accessories": _make_taxonomy_profile(  
342 - "accessories",  
343 - ACCESSORIES_TAXONOMY_FIELDS,  
344 - ),  
345 - "toys": _make_taxonomy_profile(  
346 - "toys",  
347 - TOYS_TAXONOMY_FIELDS,  
348 - ),  
349 - "shoes": _make_taxonomy_profile(  
350 - "shoes",  
351 - SHOES_TAXONOMY_FIELDS,  
352 - ),  
353 - "sports": _make_taxonomy_profile(  
354 - "sports products",  
355 - SPORTS_TAXONOMY_FIELDS,  
356 - ),  
357 - "others": _make_taxonomy_profile(  
358 - "general merchandise",  
359 - OTHERS_TAXONOMY_FIELDS,  
360 - ),  
361 -}  
362 -  
363 -TAXONOMY_SHARED_ANALYSIS_INSTRUCTION = CATEGORY_TAXONOMY_PROFILES["apparel"]["shared_instruction"]  
364 -TAXONOMY_MARKDOWN_TABLE_HEADERS_EN = CATEGORY_TAXONOMY_PROFILES["apparel"]["markdown_table_headers"]["en"]  
365 -TAXONOMY_LANGUAGE_MARKDOWN_TABLE_HEADERS: Dict[str, Dict[str, Any]] = CATEGORY_TAXONOMY_PROFILES["apparel"]["markdown_table_headers"]  
366 -  
367 -LANGUAGE_MARKDOWN_TABLE_HEADERS: Dict[str, Dict[str, Any]] = {  
368 - "en": [  
369 - "No.",  
370 - "Product title",  
371 - "Category path",  
372 - "Fine-grained tags",  
373 - "Target audience",  
374 - "Usage scene",  
375 - "Season",  
376 - "Key attributes",  
377 - "Material",  
378 - "Features",  
379 - "Anchor text"  
380 - ],  
381 - "zh": [  
382 - "序号",  
383 - "商品标题",  
384 - "品类路径",  
385 - "细分标签",  
386 - "适用人群",  
387 - "使用场景",  
388 - "适用季节",  
389 - "关键属性",  
390 - "材质说明",  
391 - "功能特点",  
392 - "锚文本"  
393 - ],  
394 - "zh_tw": [  
395 - "序號",  
396 - "商品標題",  
397 - "品類路徑",  
398 - "細分標籤",  
399 - "適用人群",  
400 - "使用場景",  
401 - "適用季節",  
402 - "關鍵屬性",  
403 - "材質說明",  
404 - "功能特點",  
405 - "錨文本"  
406 - ],  
407 - "ru": [  
408 - "№",  
409 - "Название товара",  
410 - "Путь категории",  
411 - "Детализированные теги",  
412 - "Целевая аудитория",  
413 - "Сценарий использования",  
414 - "Сезон",  
415 - "Ключевые атрибуты",  
416 - "Материал",  
417 - "Особенности",  
418 - "Анкорный текст"  
419 - ],  
420 - "ja": [  
421 - "番号",  
422 - "商品タイトル",  
423 - "カテゴリパス",  
424 - "詳細タグ",  
425 - "対象ユーザー",  
426 - "利用シーン",  
427 - "季節",  
428 - "主要属性",  
429 - "素材",  
430 - "機能特徴",  
431 - "アンカーテキスト"  
432 - ],  
433 - "ko": [  
434 - "번호",  
435 - "상품 제목",  
436 - "카테고리 경로",  
437 - "세부 태그",  
438 - "대상 고객",  
439 - "사용 장면",  
440 - "계절",  
441 - "핵심 속성",  
442 - "소재",  
443 - "기능 특징",  
444 - "앵커 텍스트"  
445 - ],  
446 - "es": [  
447 - "N.º",  
448 - "Titulo del producto",  
449 - "Ruta de categoria",  
450 - "Etiquetas detalladas",  
451 - "Publico objetivo",  
452 - "Escenario de uso",  
453 - "Temporada",  
454 - "Atributos clave",  
455 - "Material",  
456 - "Caracteristicas",  
457 - "Texto ancla"  
458 - ],  
459 - "fr": [  
460 - "N°",  
461 - "Titre du produit",  
462 - "Chemin de categorie",  
463 - "Etiquettes detaillees",  
464 - "Public cible",  
465 - "Scenario d'utilisation",  
466 - "Saison",  
467 - "Attributs cles",  
468 - "Matiere",  
469 - "Caracteristiques",  
470 - "Texte d'ancrage"  
471 - ],  
472 - "pt": [  
473 - "Nº",  
474 - "Titulo do produto",  
475 - "Caminho da categoria",  
476 - "Tags detalhadas",  
477 - "Publico-alvo",  
478 - "Cenario de uso",  
479 - "Estacao",  
480 - "Atributos principais",  
481 - "Material",  
482 - "Caracteristicas",  
483 - "Texto ancora"  
484 - ],  
485 - "de": [  
486 - "Nr.",  
487 - "Produkttitel",  
488 - "Kategoriepfad",  
489 - "Detaillierte Tags",  
490 - "Zielgruppe",  
491 - "Nutzungsszenario",  
492 - "Saison",  
493 - "Wichtige Attribute",  
494 - "Material",  
495 - "Funktionen",  
496 - "Ankertext"  
497 - ],  
498 - "it": [  
499 - "N.",  
500 - "Titolo del prodotto",  
501 - "Percorso categoria",  
502 - "Tag dettagliati",  
503 - "Pubblico target",  
504 - "Scenario d'uso",  
505 - "Stagione",  
506 - "Attributi chiave",  
507 - "Materiale",  
508 - "Caratteristiche",  
509 - "Testo ancora"  
510 - ],  
511 - "th": [  
512 - "ลำดับ",  
513 - "ชื่อสินค้า",  
514 - "เส้นทางหมวดหมู่",  
515 - "แท็กย่อย",  
516 - "กลุ่มเป้าหมาย",  
517 - "สถานการณ์การใช้งาน",  
518 - "ฤดูกาล",  
519 - "คุณสมบัติสำคัญ",  
520 - "วัสดุ",  
521 - "คุณสมบัติการใช้งาน",  
522 - "แองเคอร์เท็กซ์"  
523 - ],  
524 - "vi": [  
525 - "STT",  
526 - "Tieu de san pham",  
527 - "Duong dan danh muc",  
528 - "The chi tiet",  
529 - "Doi tuong phu hop",  
530 - "Boi canh su dung",  
531 - "Mua phu hop",  
532 - "Thuoc tinh chinh",  
533 - "Chat lieu",  
534 - "Tinh nang",  
535 - "Van ban neo"  
536 - ],  
537 - "id": [  
538 - "No.",  
539 - "Judul produk",  
540 - "Jalur kategori",  
541 - "Tag terperinci",  
542 - "Target pengguna",  
543 - "Skenario penggunaan",  
544 - "Musim",  
545 - "Atribut utama",  
546 - "Bahan",  
547 - "Fitur",  
548 - "Teks jangkar"  
549 - ],  
550 - "ms": [  
551 - "No.",  
552 - "Tajuk produk",  
553 - "Laluan kategori",  
554 - "Tag terperinci",  
555 - "Sasaran pengguna",  
556 - "Senario penggunaan",  
557 - "Musim",  
558 - "Atribut utama",  
559 - "Bahan",  
560 - "Ciri-ciri",  
561 - "Teks sauh"  
562 - ],  
563 - "ar": [  
564 - "الرقم",  
565 - "عنوان المنتج",  
566 - "مسار الفئة",  
567 - "الوسوم التفصيلية",  
568 - "الفئة المستهدفة",  
569 - "سيناريو الاستخدام",  
570 - "الموسم",  
571 - "السمات الرئيسية",  
572 - "المادة",  
573 - "الميزات",  
574 - "نص الربط"  
575 - ],  
576 - "hi": [  
577 - "क्रमांक",  
578 - "उत्पाद शीर्षक",  
579 - "श्रेणी पथ",  
580 - "विस्तृत टैग",  
581 - "लक्षित उपभोक्ता",  
582 - "उपयोग परिदृश्य",  
583 - "मौसम",  
584 - "मुख्य गुण",  
585 - "सामग्री",  
586 - "विशेषताएं",  
587 - "एंकर टेक्स्ट"  
588 - ],  
589 - "he": [  
590 - "מס׳",  
591 - "כותרת המוצר",  
592 - "נתיב קטגוריה",  
593 - "תגיות מפורטות",  
594 - "קהל יעד",  
595 - "תרחיש שימוש",  
596 - "עונה",  
597 - "מאפיינים מרכזיים",  
598 - "חומר",  
599 - "תכונות",  
600 - "טקסט עוגן"  
601 - ],  
602 - "my": [  
603 - "အမှတ်စဉ်",  
604 - "ကုန်ပစ္စည်းခေါင်းစဉ်",  
605 - "အမျိုးအစားလမ်းကြောင်း",  
606 - "အသေးစိတ်တဂ်များ",  
607 - "ပစ်မှတ်အသုံးပြုသူ",  
608 - "အသုံးပြုမှုအခြေအနေ",  
609 - "ရာသီ",  
610 - "အဓိကဂုဏ်သတ္တိများ",  
611 - "ပစ္စည်း",  
612 - "လုပ်ဆောင်ချက်များ",  
613 - "အန်ကာစာသား"  
614 - ],  
615 - "ta": [  
616 - "எண்",  
617 - "தயாரிப்பு தலைப்பு",  
618 - "வகை பாதை",  
619 - "விரிவான குறிச்சொற்கள்",  
620 - "இலக்கு பயனர்கள்",  
621 - "பயன்பாட்டு நிலை",  
622 - "பருவம்",  
623 - "முக்கிய பண்புகள்",  
624 - "பொருள்",  
625 - "அம்சங்கள்",  
626 - "ஆங்கர் உரை"  
627 - ],  
628 - "ur": [  
629 - "نمبر",  
630 - "پروڈکٹ عنوان",  
631 - "زمرہ راستہ",  
632 - "تفصیلی ٹیگز",  
633 - "ہدف صارفین",  
634 - "استعمال کا منظر",  
635 - "موسم",  
636 - "کلیدی خصوصیات",  
637 - "مواد",  
638 - "فیچرز",  
639 - "اینکر ٹیکسٹ"  
640 - ],  
641 - "bn": [  
642 - "ক্রম",  
643 - "পণ্যের শিরোনাম",  
644 - "শ্রেণি পথ",  
645 - "বিস্তারিত ট্যাগ",  
646 - "লক্ষ্য ব্যবহারকারী",  
647 - "ব্যবহারের দৃশ্য",  
648 - "মৌসুম",  
649 - "মূল বৈশিষ্ট্য",  
650 - "উপাদান",  
651 - "ফিচার",  
652 - "অ্যাঙ্কর টেক্সট"  
653 - ],  
654 - "pl": [  
655 - "Nr",  
656 - "Tytul produktu",  
657 - "Sciezka kategorii",  
658 - "Szczegolowe tagi",  
659 - "Grupa docelowa",  
660 - "Scenariusz uzycia",  
661 - "Sezon",  
662 - "Kluczowe atrybuty",  
663 - "Material",  
664 - "Cechy",  
665 - "Tekst kotwicy"  
666 - ],  
667 - "nl": [  
668 - "Nr.",  
669 - "Producttitel",  
670 - "Categoriepad",  
671 - "Gedetailleerde tags",  
672 - "Doelgroep",  
673 - "Gebruikscontext",  
674 - "Seizoen",  
675 - "Belangrijke kenmerken",  
676 - "Materiaal",  
677 - "Functies",  
678 - "Ankertekst"  
679 - ],  
680 - "ro": [  
681 - "Nr.",  
682 - "Titlul produsului",  
683 - "Calea categoriei",  
684 - "Etichete detaliate",  
685 - "Public tinta",  
686 - "Scenariu de utilizare",  
687 - "Sezon",  
688 - "Atribute cheie",  
689 - "Material",  
690 - "Caracteristici",  
691 - "Text ancora"  
692 - ],  
693 - "tr": [  
694 - "No.",  
695 - "Urun basligi",  
696 - "Kategori yolu",  
697 - "Ayrintili etiketler",  
698 - "Hedef kitle",  
699 - "Kullanim senaryosu",  
700 - "Sezon",  
701 - "Temel ozellikler",  
702 - "Malzeme",  
703 - "Ozellikler",  
704 - "Capa metni"  
705 - ],  
706 - "km": [  
707 - "ល.រ",  
708 - "ចំណងជើងផលិតផល",  
709 - "ផ្លូវប្រភេទ",  
710 - "ស្លាកលម្អិត",  
711 - "ក្រុមអ្នកប្រើគោលដៅ",  
712 - "សេណារីយ៉ូប្រើប្រាស់",  
713 - "រដូវកាល",  
714 - "លក្ខណៈសម្បត្តិសំខាន់",  
715 - "សម្ភារៈ",  
716 - "មុខងារ",  
717 - "អត្ថបទអង់ក័រ"  
718 - ],  
719 - "lo": [  
720 - "ລຳດັບ",  
721 - "ຊື່ສິນຄ້າ",  
722 - "ເສັ້ນທາງໝວດໝູ່",  
723 - "ແທັກລະອຽດ",  
724 - "ກຸ່ມເປົ້າໝາຍ",  
725 - "ສະຖານະການໃຊ້ງານ",  
726 - "ລະດູການ",  
727 - "ຄຸນລັກສະນະສຳຄັນ",  
728 - "ວັດສະດຸ",  
729 - "ຄຸນສົມບັດ",  
730 - "ຂໍ້ຄວາມອັງເຄີ"  
731 - ],  
732 - "yue": [  
733 - "序號",  
734 - "商品標題",  
735 - "品類路徑",  
736 - "細分類標籤",  
737 - "適用人群",  
738 - "使用場景",  
739 - "適用季節",  
740 - "關鍵屬性",  
741 - "材質說明",  
742 - "功能特點",  
743 - "錨文本"  
744 - ],  
745 - "cs": [  
746 - "C.",  
747 - "Nazev produktu",  
748 - "Cesta kategorie",  
749 - "Podrobne stitky",  
750 - "Cilova skupina",  
751 - "Scenar pouziti",  
752 - "Sezona",  
753 - "Klicove atributy",  
754 - "Material",  
755 - "Vlastnosti",  
756 - "Kotvici text"  
757 - ],  
758 - "el": [  
759 - "Α/Α",  
760 - "Τίτλος προϊόντος",  
761 - "Διαδρομή κατηγορίας",  
762 - "Αναλυτικές ετικέτες",  
763 - "Κοινό-στόχος",  
764 - "Σενάριο χρήσης",  
765 - "Εποχή",  
766 - "Βασικά χαρακτηριστικά",  
767 - "Υλικό",  
768 - "Λειτουργίες",  
769 - "Κείμενο άγκυρας"  
770 - ],  
771 - "sv": [  
772 - "Nr",  
773 - "Produkttitel",  
774 - "Kategorisokvag",  
775 - "Detaljerade taggar",  
776 - "Malgrupp",  
777 - "Anvandningsscenario",  
778 - "Sasong",  
779 - "Viktiga attribut",  
780 - "Material",  
781 - "Funktioner",  
782 - "Ankartext"  
783 - ],  
784 - "hu": [  
785 - "Sorszam",  
786 - "Termekcim",  
787 - "Kategoriavonal",  
788 - "Reszletes cimkek",  
789 - "Celcsoport",  
790 - "Hasznalati helyzet",  
791 - "Evszak",  
792 - "Fo jellemzok",  
793 - "Anyag",  
794 - "Funkciok",  
795 - "Horgonyszoveg"  
796 - ],  
797 - "da": [  
798 - "Nr.",  
799 - "Produkttitel",  
800 - "Kategoristi",  
801 - "Detaljerede tags",  
802 - "Malgruppe",  
803 - "Brugsscenarie",  
804 - "Saeson",  
805 - "Nogleattributter",  
806 - "Materiale",  
807 - "Funktioner",  
808 - "Ankertekst"  
809 - ],  
810 - "fi": [  
811 - "Nro",  
812 - "Tuotteen nimi",  
813 - "Kategoriapolku",  
814 - "Yksityiskohtaiset tunnisteet",  
815 - "Kohdeyleiso",  
816 - "Kayttotilanne",  
817 - "Kausi",  
818 - "Keskeiset ominaisuudet",  
819 - "Materiaali",  
820 - "Ominaisuudet",  
821 - "Ankkuriteksti"  
822 - ],  
823 - "uk": [  
824 - "№",  
825 - "Назва товару",  
826 - "Шлях категорії",  
827 - "Детальні теги",  
828 - "Цільова аудиторія",  
829 - "Сценарій використання",  
830 - "Сезон",  
831 - "Ключові атрибути",  
832 - "Матеріал",  
833 - "Особливості",  
834 - "Анкорний текст"  
835 - ],  
836 - "bg": [  
837 - "№",  
838 - "Заглавие на продукта",  
839 - "Път на категорията",  
840 - "Подробни тагове",  
841 - "Целева аудитория",  
842 - "Сценарий на употреба",  
843 - "Сезон",  
844 - "Ключови атрибути",  
845 - "Материал",  
846 - "Характеристики",  
847 - "Анкор текст"  
848 - ]  
849 -}  
indexer/product_enrich模块说明.md deleted
@@ -1,173 +0,0 @@ @@ -1,173 +0,0 @@
1 -# 内容富化模块说明  
2 -  
3 -本文说明商品内容富化模块的职责、入口、输出结构,以及当前 taxonomy profile 的设计约束。  
4 -  
5 -## 1. 模块目标  
6 -  
7 -内容富化模块负责基于商品文本调用 LLM,生成以下索引字段:  
8 -  
9 -- `qanchors`  
10 -- `enriched_tags`  
11 -- `enriched_attributes`  
12 -- `enriched_taxonomy_attributes`  
13 -  
14 -模块追求的设计原则:  
15 -  
16 -- 单一职责:只负责内容理解与结构化输出,不负责 CSV 读写  
17 -- 输出对齐 ES mapping:返回结构可直接写入 `search_products`  
18 -- 配置化扩展:taxonomy profile 通过数据配置扩展,而不是散落条件分支  
19 -- 代码精简:只面向正常使用方式,避免为了不合理调用堆叠补丁逻辑  
20 -  
21 -## 2. 主要文件  
22 -  
23 -- [product_enrich.py](/data/saas-search/indexer/product_enrich.py)  
24 - 运行时主逻辑,负责批处理、缓存、prompt 组装、LLM 调用、markdown 解析、输出整理  
25 -- [product_enrich_prompts.py](/data/saas-search/indexer/product_enrich_prompts.py)  
26 - prompt 模板与 taxonomy profile 配置  
27 -- [document_transformer.py](/data/saas-search/indexer/document_transformer.py)  
28 - 在内部索引构建链路中调用内容富化模块,把结果回填到 ES doc  
29 -- [taxonomy.md](/data/saas-search/indexer/taxonomy.md)  
30 - taxonomy 设计说明与字段清单  
31 -  
32 -## 3. 对外入口  
33 -  
34 -### 3.1 Python 入口  
35 -  
36 -核心入口:  
37 -  
38 -```python  
39 -build_index_content_fields(  
40 - items,  
41 - tenant_id=None,  
42 - enrichment_scopes=None,  
43 - category_taxonomy_profile=None,  
44 -)  
45 -```  
46 -  
47 -输入最小要求:  
48 -  
49 -- `id` 或 `spu_id`  
50 -- `title`  
51 -  
52 -可选输入:  
53 -  
54 -- `brief`  
55 -- `description`  
56 -- `image_url`  
57 -  
58 -关键参数:  
59 -  
60 -- `enrichment_scopes`  
61 - 可选 `generic`、`category_taxonomy`  
62 -- `category_taxonomy_profile`  
63 - taxonomy profile;默认 `apparel`  
64 -  
65 -### 3.2 HTTP 入口  
66 -  
67 -API 路由:  
68 -  
69 -- `POST /indexer/enrich-content`  
70 -  
71 -对应文档:  
72 -  
73 -- [搜索API对接指南-05-索引接口(Indexer)](/data/saas-search/docs/搜索API对接指南-05-索引接口(Indexer).md)  
74 -- [搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation)](/data/saas-search/docs/搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md)  
75 -  
76 -## 4. 输出结构  
77 -  
78 -返回结果与 ES mapping 对齐:  
79 -  
80 -```json  
81 -{  
82 - "id": "223167",  
83 - "qanchors": {  
84 - "zh": ["短袖T恤", "纯棉"],  
85 - "en": ["t-shirt", "cotton"]  
86 - },  
87 - "enriched_tags": {  
88 - "zh": ["短袖", "纯棉"],  
89 - "en": ["short sleeve", "cotton"]  
90 - },  
91 - "enriched_attributes": [  
92 - {  
93 - "name": "enriched_tags",  
94 - "value": {  
95 - "zh": ["短袖", "纯棉"],  
96 - "en": ["short sleeve", "cotton"]  
97 - }  
98 - }  
99 - ],  
100 - "enriched_taxonomy_attributes": [  
101 - {  
102 - "name": "Product Type",  
103 - "value": {  
104 - "zh": ["T恤"],  
105 - "en": ["t-shirt"]  
106 - }  
107 - }  
108 - ]  
109 -}  
110 -```  
111 -  
112 -说明:  
113 -  
114 -- `generic` 部分固定输出核心索引语言 `zh`、`en`  
115 -- `taxonomy` 部分同样统一输出 `zh`、`en`  
116 -  
117 -## 5. Taxonomy profile  
118 -  
119 -当前支持:  
120 -  
121 -- `apparel`  
122 -- `3c`  
123 -- `bags`  
124 -- `pet_supplies`  
125 -- `electronics`  
126 -- `outdoor`  
127 -- `home_appliances`  
128 -- `home_living`  
129 -- `wigs`  
130 -- `beauty`  
131 -- `accessories`  
132 -- `toys`  
133 -- `shoes`  
134 -- `sports`  
135 -- `others`  
136 -  
137 -统一约束:  
138 -  
139 -- 所有 profile 都返回 `zh` + `en`  
140 -- profile 只决定 taxonomy 字段集合,不再决定输出语言  
141 -- 所有 profile 都配置中英文字段名,prompt/header 结构保持一致  
142 -  
143 -## 6. 内部索引链路的当前约束  
144 -  
145 -在内部 ES 文档构建链路里,`document_transformer` 当前调用内容富化时,taxonomy profile 暂时固定传:  
146 -  
147 -```python  
148 -category_taxonomy_profile="apparel"  
149 -```  
150 -  
151 -这是一种显式、可控、代码更干净的临时策略。  
152 -  
153 -当前代码里已保留 TODO:  
154 -  
155 -- 后续从数据库读取租户真实所属行业  
156 -- 再用该行业替换固定的 `apparel`  
157 -  
158 -当前不做“根据商品类目文本自动猜 profile”的隐式逻辑,避免增加冗余代码与不必要的不确定性。  
159 -  
160 -## 7. 缓存与批处理  
161 -  
162 -缓存键由以下信息共同决定:  
163 -  
164 -- `analysis_kind`  
165 -- `target_lang`  
166 -- prompt/schema 版本指纹  
167 -- prompt 实际输入文本  
168 -  
169 -批处理规则:  
170 -  
171 -- 单次 LLM 调用最多 20 条  
172 -- 上层允许传更大批次,模块内部自动拆批  
173 -- uncached batch 可并发执行  
indexer/spu_transformer.py
@@ -220,7 +220,6 @@ class SPUTransformer: @@ -220,7 +220,6 @@ class SPUTransformer:
220 logger.info(f"Grouped options into {len(option_groups)} SPU groups") 220 logger.info(f"Grouped options into {len(option_groups)} SPU groups")
221 221
222 documents: List[Dict[str, Any]] = [] 222 documents: List[Dict[str, Any]] = []
223 - doc_spu_rows: List[pd.Series] = []  
224 skipped_count = 0 223 skipped_count = 0
225 error_count = 0 224 error_count = 0
226 225
@@ -244,11 +243,9 @@ class SPUTransformer: @@ -244,11 +243,9 @@ class SPUTransformer:
244 spu_row=spu_row, 243 spu_row=spu_row,
245 skus=skus, 244 skus=skus,
246 options=options, 245 options=options,
247 - fill_llm_attributes=False,  
248 ) 246 )
249 if doc: 247 if doc:
250 documents.append(doc) 248 documents.append(doc)
251 - doc_spu_rows.append(spu_row)  
252 else: 249 else:
253 skipped_count += 1 250 skipped_count += 1
254 logger.warning(f"SPU {spu_id} transformation returned None, skipped") 251 logger.warning(f"SPU {spu_id} transformation returned None, skipped")
@@ -256,13 +253,6 @@ class SPUTransformer: @@ -256,13 +253,6 @@ class SPUTransformer:
256 error_count += 1 253 error_count += 1
257 logger.error(f"Error transforming SPU {spu_id}: {e}", exc_info=True) 254 logger.error(f"Error transforming SPU {spu_id}: {e}", exc_info=True)
258 255
259 - # 批量填充 LLM 字段(尽量攒批,每次最多 20 条;失败仅 warning,不影响主流程)  
260 - try:  
261 - if documents and doc_spu_rows:  
262 - self.document_transformer.fill_llm_attributes_batch(documents, doc_spu_rows)  
263 - except Exception as e:  
264 - logger.warning("Batch LLM fill failed in transform_batch: %s", e)  
265 -  
266 logger.info(f"Transformation complete:") 256 logger.info(f"Transformation complete:")
267 logger.info(f" - Total SPUs: {len(spu_df)}") 257 logger.info(f" - Total SPUs: {len(spu_df)}")
268 logger.info(f" - Successfully transformed: {len(documents)}") 258 logger.info(f" - Successfully transformed: {len(documents)}")
@@ -270,5 +260,3 @@ class SPUTransformer: @@ -270,5 +260,3 @@ class SPUTransformer:
270 logger.info(f" - Errors: {error_count}") 260 logger.info(f" - Errors: {error_count}")
271 261
272 return documents 262 return documents
273 -  
274 -  
scripts/debug/trace_indexer_calls.sh
@@ -66,7 +66,7 @@ echo &quot;&quot; @@ -66,7 +66,7 @@ echo &quot;&quot;
66 echo " - Indexer 内部会调用:" 66 echo " - Indexer 内部会调用:"
67 echo " - Text Embedding 服务 (${EMBEDDING_TEXT_PORT}): POST /embed/text" 67 echo " - Text Embedding 服务 (${EMBEDDING_TEXT_PORT}): POST /embed/text"
68 echo " - Image Embedding 服务 (${EMBEDDING_IMAGE_PORT}): POST /embed/image" 68 echo " - Image Embedding 服务 (${EMBEDDING_IMAGE_PORT}): POST /embed/image"
69 -echo " - Qwen API: dashscope.aliyuncs.com (翻译、LLM 分析)" 69 +echo " - Translation 服务 / 翻译后端(按当前配置)"
70 echo " - MySQL: 商品数据" 70 echo " - MySQL: 商品数据"
71 echo " - Elasticsearch: 写入索引" 71 echo " - Elasticsearch: 写入索引"
72 echo "" 72 echo ""
scripts/redis/redis_cache_health_check.py
@@ -2,7 +2,7 @@ @@ -2,7 +2,7 @@
2 """ 2 """
3 缓存状态巡检脚本 3 缓存状态巡检脚本
4 4
5 -按「缓存类型」维度(embedding / translation / anchors)查看: 5 +按「缓存类型」维度(embedding / translation)查看:
6 - 估算 key 数量 6 - 估算 key 数量
7 - TTL 分布(采样) 7 - TTL 分布(采样)
8 - 近期活跃 key(按 IDLETIME 近似) 8 - 近期活跃 key(按 IDLETIME 近似)
@@ -10,12 +10,12 @@ @@ -10,12 +10,12 @@
10 10
11 使用示例: 11 使用示例:
12 12
13 - # 默认:检查已知类缓存,使用 env_config 中的 Redis 配置 13 + # 默认:检查已知类缓存,使用 env_config 中的 Redis 配置
14 python scripts/redis/redis_cache_health_check.py 14 python scripts/redis/redis_cache_health_check.py
15 15
16 # 只看某一类缓存 16 # 只看某一类缓存
17 python scripts/redis/redis_cache_health_check.py --type embedding 17 python scripts/redis/redis_cache_health_check.py --type embedding
18 - python scripts/redis/redis_cache_health_check.py --type translation anchors 18 + python scripts/redis/redis_cache_health_check.py --type translation
19 19
20 # 自定义前缀(pattern),不限定缓存类型 20 # 自定义前缀(pattern),不限定缓存类型
21 python scripts/redis/redis_cache_health_check.py --pattern "mycache:*" 21 python scripts/redis/redis_cache_health_check.py --pattern "mycache:*"
@@ -27,7 +27,6 @@ @@ -27,7 +27,6 @@
27 from __future__ import annotations 27 from __future__ import annotations
28 28
29 import argparse 29 import argparse
30 -import json  
31 import sys 30 import sys
32 from collections import defaultdict 31 from collections import defaultdict
33 from dataclasses import dataclass 32 from dataclasses import dataclass
@@ -54,7 +53,7 @@ class CacheTypeConfig: @@ -54,7 +53,7 @@ class CacheTypeConfig:
54 53
55 54
56 def _load_known_cache_types() -> Dict[str, CacheTypeConfig]: 55 def _load_known_cache_types() -> Dict[str, CacheTypeConfig]:
57 - """根据当前配置装配三种已知缓存类型及其前缀 pattern。""" 56 + """根据当前配置装配仓库内仍在使用的缓存类型及其前缀 pattern。"""
58 cache_types: Dict[str, CacheTypeConfig] = {} 57 cache_types: Dict[str, CacheTypeConfig] = {}
59 58
60 # embedding 缓存:prefix 来自 REDIS_CONFIG['embedding_cache_prefix'](默认 embedding) 59 # embedding 缓存:prefix 来自 REDIS_CONFIG['embedding_cache_prefix'](默认 embedding)
@@ -72,14 +71,6 @@ def _load_known_cache_types() -&gt; Dict[str, CacheTypeConfig]: @@ -72,14 +71,6 @@ def _load_known_cache_types() -&gt; Dict[str, CacheTypeConfig]:
72 description="翻译结果缓存(translation/service.py)", 71 description="翻译结果缓存(translation/service.py)",
73 ) 72 )
74 73
75 - # anchors 缓存:prefix 来自 REDIS_CONFIG['anchor_cache_prefix'](若存在),否则 product_anchors  
76 - anchor_prefix = REDIS_CONFIG.get("anchor_cache_prefix", "product_anchors")  
77 - cache_types["anchors"] = CacheTypeConfig(  
78 - name="anchors",  
79 - pattern=f"{anchor_prefix}:*",  
80 - description="商品内容理解缓存(indexer/product_enrich.py,anchors/语义属性/tags)",  
81 - )  
82 -  
83 return cache_types 74 return cache_types
84 75
85 76
@@ -162,23 +153,6 @@ def decode_value_preview( @@ -162,23 +153,6 @@ def decode_value_preview(
162 except Exception: 153 except Exception:
163 return f"<binary {len(raw_value)} bytes>" 154 return f"<binary {len(raw_value)} bytes>"
164 155
165 - # anchors: JSON dict  
166 - if cache_type == "anchors":  
167 - try:  
168 - text = raw_value.decode("utf-8", errors="replace")  
169 - obj = json.loads(text)  
170 - if isinstance(obj, dict):  
171 - brief = {  
172 - k: obj.get(k)  
173 - for k in ["id", "lang", "title_input", "title", "category_path", "anchor_text"]  
174 - if k in obj  
175 - }  
176 - return "json " + json.dumps(brief, ensure_ascii=False)[:200]  
177 - # 其他情况简单截断  
178 - return "json " + text[:200]  
179 - except Exception:  
180 - return raw_value.decode("utf-8", errors="replace")[:200]  
181 -  
182 # translation: 纯字符串 156 # translation: 纯字符串
183 if cache_type == "translation": 157 if cache_type == "translation":
184 try: 158 try:
@@ -308,8 +282,8 @@ def main() -&gt; None: @@ -308,8 +282,8 @@ def main() -&gt; None:
308 "--type", 282 "--type",
309 dest="types", 283 dest="types",
310 nargs="+", 284 nargs="+",
311 - choices=["embedding", "translation", "anchors"],  
312 - help="指定要检查的缓存类型(默认:三种全部)", 285 + choices=["embedding", "translation"],
  286 + help="指定要检查的缓存类型(默认:两种全部)",
313 ) 287 )
314 parser.add_argument( 288 parser.add_argument(
315 "--pattern", 289 "--pattern",
scripts/redis/redis_cache_prefix_stats.py
@@ -15,7 +15,7 @@ python scripts/redis/redis_cache_prefix_stats.py --all-db @@ -15,7 +15,7 @@ python scripts/redis/redis_cache_prefix_stats.py --all-db
15 统计指定数据库: 15 统计指定数据库:
16 python scripts/redis/redis_cache_prefix_stats.py --db 1 16 python scripts/redis/redis_cache_prefix_stats.py --db 1
17 17
18 -只统计以下三种前缀: 18 +只统计若干常见前缀:
19 python scripts/redis/redis_cache_prefix_stats.py --prefix trans embedding product 19 python scripts/redis/redis_cache_prefix_stats.py --prefix trans embedding product
20 20
21 统计所有数据库的指定前缀: 21 统计所有数据库的指定前缀:
tests/ci/test_service_api_contracts.py
@@ -342,162 +342,15 @@ def test_indexer_build_docs_from_db_contract(indexer_client: TestClient): @@ -342,162 +342,15 @@ def test_indexer_build_docs_from_db_contract(indexer_client: TestClient):
342 assert data["docs"][0]["spu_id"] == "1001" 342 assert data["docs"][0]["spu_id"] == "1001"
343 343
344 344
345 -def test_indexer_enrich_content_contract(indexer_client: TestClient, monkeypatch):  
346 - import indexer.product_enrich as process_products  
347 -  
348 - def _fake_build_index_content_fields(  
349 - items: List[Dict[str, str]],  
350 - tenant_id: str | None = None,  
351 - enrichment_scopes: List[str] | None = None,  
352 - category_taxonomy_profile: str = "apparel",  
353 - ):  
354 - assert tenant_id == "162"  
355 - assert enrichment_scopes == ["generic", "category_taxonomy"]  
356 - assert category_taxonomy_profile == "apparel"  
357 - return [  
358 - {  
359 - "id": p["spu_id"],  
360 - "qanchors": {  
361 - "zh": [f"zh-anchor-{p['spu_id']}"],  
362 - "en": [f"en-anchor-{p['spu_id']}"],  
363 - },  
364 - "enriched_tags": {"zh": ["tag1", "tag2"], "en": ["tag1", "tag2"]},  
365 - "enriched_attributes": [  
366 - {"name": "enriched_tags", "value": {"zh": ["tag1"], "en": ["tag1"]}},  
367 - ],  
368 - "enriched_taxonomy_attributes": [  
369 - {"name": "Product Type", "value": {"zh": ["T恤"], "en": ["t-shirt"]}},  
370 - ],  
371 - }  
372 - for p in items  
373 - ]  
374 -  
375 - monkeypatch.setattr(process_products, "build_index_content_fields", _fake_build_index_content_fields)  
376 -  
377 - response = indexer_client.post(  
378 - "/indexer/enrich-content",  
379 - json={  
380 - "tenant_id": "162",  
381 - "enrichment_scopes": ["generic", "category_taxonomy"],  
382 - "category_taxonomy_profile": "apparel",  
383 - "items": [  
384 - {"spu_id": "1001", "title": "T-shirt"},  
385 - {"spu_id": "1002", "title": "Toy"},  
386 - ],  
387 - },  
388 - )  
389 - assert response.status_code == 200  
390 - data = response.json()  
391 - assert data["tenant_id"] == "162"  
392 - assert data["enrichment_scopes"] == ["generic", "category_taxonomy"]  
393 - assert data["category_taxonomy_profile"] == "apparel"  
394 - assert data["total"] == 2  
395 - assert len(data["results"]) == 2  
396 - assert data["results"][0]["spu_id"] == "1001"  
397 - assert data["results"][0]["qanchors"]["zh"] == ["zh-anchor-1001"]  
398 - assert data["results"][0]["qanchors"]["en"] == ["en-anchor-1001"]  
399 - assert data["results"][0]["enriched_tags"]["zh"] == ["tag1", "tag2"]  
400 - assert data["results"][0]["enriched_tags"]["en"] == ["tag1", "tag2"]  
401 - assert data["results"][0]["enriched_attributes"][0] == {  
402 - "name": "enriched_tags",  
403 - "value": {"zh": ["tag1"], "en": ["tag1"]},  
404 - }  
405 - assert data["results"][0]["enriched_taxonomy_attributes"][0] == {  
406 - "name": "Product Type",  
407 - "value": {"zh": ["T恤"], "en": ["t-shirt"]},  
408 - }  
409 -  
410 -  
411 -def test_indexer_enrich_content_contract_accepts_deprecated_analysis_kinds(indexer_client: TestClient, monkeypatch):  
412 - import indexer.product_enrich as process_products  
413 -  
414 - seen: Dict[str, Any] = {}  
415 -  
416 - def _fake_build_index_content_fields(  
417 - items: List[Dict[str, str]],  
418 - tenant_id: str | None = None,  
419 - enrichment_scopes: List[str] | None = None,  
420 - category_taxonomy_profile: str = "apparel",  
421 - ):  
422 - seen["tenant_id"] = tenant_id  
423 - seen["enrichment_scopes"] = enrichment_scopes  
424 - seen["category_taxonomy_profile"] = category_taxonomy_profile  
425 - return [  
426 - {  
427 - "id": items[0]["spu_id"],  
428 - "qanchors": {},  
429 - "enriched_tags": {},  
430 - "enriched_attributes": [],  
431 - "enriched_taxonomy_attributes": [],  
432 - }  
433 - ]  
434 -  
435 - monkeypatch.setattr(process_products, "build_index_content_fields", _fake_build_index_content_fields)  
436 - 345 +def test_indexer_enrich_content_route_removed(indexer_client: TestClient):
437 response = indexer_client.post( 346 response = indexer_client.post(
438 "/indexer/enrich-content", 347 "/indexer/enrich-content",
439 json={ 348 json={
440 "tenant_id": "162", 349 "tenant_id": "162",
441 - "analysis_kinds": ["taxonomy"],  
442 "items": [{"spu_id": "1001", "title": "T-shirt"}], 350 "items": [{"spu_id": "1001", "title": "T-shirt"}],
443 }, 351 },
444 ) 352 )
445 -  
446 - assert response.status_code == 200  
447 - data = response.json()  
448 - assert seen == {  
449 - "tenant_id": "162",  
450 - "enrichment_scopes": ["category_taxonomy"],  
451 - "category_taxonomy_profile": "apparel",  
452 - }  
453 - assert data["enrichment_scopes"] == ["category_taxonomy"]  
454 - assert data["category_taxonomy_profile"] == "apparel"  
455 -  
456 -  
457 -def test_indexer_enrich_content_contract_supports_non_apparel_taxonomy_profiles(indexer_client: TestClient, monkeypatch):  
458 - import indexer.product_enrich as process_products  
459 -  
460 - def _fake_build_index_content_fields(  
461 - items: List[Dict[str, str]],  
462 - tenant_id: str | None = None,  
463 - enrichment_scopes: List[str] | None = None,  
464 - category_taxonomy_profile: str = "apparel",  
465 - ):  
466 - assert tenant_id == "162"  
467 - assert enrichment_scopes == ["category_taxonomy"]  
468 - assert category_taxonomy_profile == "toys"  
469 - return [  
470 - {  
471 - "id": items[0]["spu_id"],  
472 - "qanchors": {},  
473 - "enriched_tags": {},  
474 - "enriched_attributes": [],  
475 - "enriched_taxonomy_attributes": [  
476 - {"name": "Product Type", "value": {"en": ["doll set"]}},  
477 - {"name": "Age Group", "value": {"en": ["kids"]}},  
478 - ],  
479 - }  
480 - ]  
481 -  
482 - monkeypatch.setattr(process_products, "build_index_content_fields", _fake_build_index_content_fields)  
483 -  
484 - response = indexer_client.post(  
485 - "/indexer/enrich-content",  
486 - json={  
487 - "tenant_id": "162",  
488 - "enrichment_scopes": ["category_taxonomy"],  
489 - "category_taxonomy_profile": "toys",  
490 - "items": [{"spu_id": "1001", "title": "Toy"}],  
491 - },  
492 - )  
493 -  
494 - assert response.status_code == 200  
495 - data = response.json()  
496 - assert data["category_taxonomy_profile"] == "toys"  
497 - assert data["results"][0]["enriched_taxonomy_attributes"] == [  
498 - {"name": "Product Type", "value": {"en": ["doll set"]}},  
499 - {"name": "Age Group", "value": {"en": ["kids"]}},  
500 - ] 353 + assert response.status_code == 404
501 354
502 355
503 def test_indexer_documents_contract(indexer_client: TestClient): 356 def test_indexer_documents_contract(indexer_client: TestClient):
@@ -614,17 +467,6 @@ def test_indexer_build_docs_from_db_validation_max_spu_ids(indexer_client: TestC @@ -614,17 +467,6 @@ def test_indexer_build_docs_from_db_validation_max_spu_ids(indexer_client: TestC
614 assert response.status_code == 400 467 assert response.status_code == 400
615 468
616 469
617 -def test_indexer_enrich_content_validation_max_items(indexer_client: TestClient):  
618 - response = indexer_client.post(  
619 - "/indexer/enrich-content",  
620 - json={  
621 - "tenant_id": "162",  
622 - "items": [{"spu_id": str(i), "title": "x"} for i in range(51)],  
623 - },  
624 - )  
625 - assert response.status_code == 400  
626 -  
627 -  
628 def test_indexer_documents_validation_max_spu_ids(indexer_client: TestClient): 470 def test_indexer_documents_validation_max_spu_ids(indexer_client: TestClient):
629 """POST /indexer/documents: 400 when spu_ids > 100.""" 471 """POST /indexer/documents: 400 when spu_ids > 100."""
630 response = indexer_client.post( 472 response = indexer_client.post(
tests/test_llm_enrichment_batch_fill.py deleted
@@ -1,72 +0,0 @@ @@ -1,72 +0,0 @@
1 -from __future__ import annotations  
2 -  
3 -from typing import Any, Dict, List  
4 -  
5 -import pandas as pd  
6 -  
7 -from indexer.document_transformer import SPUDocumentTransformer  
8 -  
9 -  
10 -def test_fill_llm_attributes_batch_uses_product_enrich_helper(monkeypatch):  
11 - seen_calls: List[Dict[str, Any]] = []  
12 -  
13 - def _fake_build_index_content_fields(items, tenant_id=None, category_taxonomy_profile=None):  
14 - seen_calls.append(  
15 - {  
16 - "n": len(items),  
17 - "tenant_id": tenant_id,  
18 - "category_taxonomy_profile": category_taxonomy_profile,  
19 - }  
20 - )  
21 - return [  
22 - {  
23 - "id": item["id"],  
24 - "qanchors": {  
25 - "zh": [f"zh-anchor-{item['id']}"],  
26 - "en": [f"en-anchor-{item['id']}"],  
27 - },  
28 - "enriched_tags": {"zh": ["t1", "t2"], "en": ["t1", "t2"]},  
29 - "enriched_attributes": [  
30 - {"name": "tags", "value": {"zh": ["t1"], "en": ["t1"]}},  
31 - ],  
32 - "enriched_taxonomy_attributes": [  
33 - {"name": "Product Type", "value": {"zh": ["连衣裙"], "en": ["dress"]}},  
34 - ],  
35 - }  
36 - for item in items  
37 - ]  
38 -  
39 - import indexer.document_transformer as doc_tr  
40 -  
41 - monkeypatch.setattr(doc_tr, "build_index_content_fields", _fake_build_index_content_fields)  
42 -  
43 - transformer = SPUDocumentTransformer(  
44 - category_id_to_name={},  
45 - searchable_option_dimensions=[],  
46 - tenant_config={"index_languages": ["zh", "en"], "primary_language": "zh"},  
47 - translator=None,  
48 - encoder=None,  
49 - enable_title_embedding=False,  
50 - image_encoder=None,  
51 - enable_image_embedding=False,  
52 - )  
53 -  
54 - docs: List[Dict[str, Any]] = []  
55 - rows: List[pd.Series] = []  
56 - for i in range(45):  
57 - docs.append({"tenant_id": "162", "spu_id": str(i)})  
58 - rows.append(pd.Series({"id": i, "title": f"title-{i}"}))  
59 -  
60 - transformer.fill_llm_attributes_batch(docs, rows)  
61 -  
62 - assert seen_calls == [{"n": 45, "tenant_id": "162", "category_taxonomy_profile": "apparel"}]  
63 -  
64 - assert docs[0]["qanchors"]["zh"] == ["zh-anchor-0"]  
65 - assert docs[0]["qanchors"]["en"] == ["en-anchor-0"]  
66 - assert docs[0]["enriched_tags"]["zh"] == ["t1", "t2"]  
67 - assert docs[0]["enriched_tags"]["en"] == ["t1", "t2"]  
68 - assert {"name": "tags", "value": {"zh": ["t1"], "en": ["t1"]}} in docs[0]["enriched_attributes"]  
69 - assert {  
70 - "name": "Product Type",  
71 - "value": {"zh": ["连衣裙"], "en": ["dress"]},  
72 - } in docs[0]["enriched_taxonomy_attributes"]  
tests/test_process_products_batching.py deleted
@@ -1,104 +0,0 @@ @@ -1,104 +0,0 @@
1 -from __future__ import annotations  
2 -  
3 -from typing import Any, Dict, List  
4 -  
5 -import indexer.product_enrich as process_products  
6 -  
7 -  
8 -def _mk_products(n: int) -> List[Dict[str, str]]:  
9 - return [{"id": str(i), "title": f"title-{i}"} for i in range(n)]  
10 -  
11 -  
12 -def test_analyze_products_caps_batch_size_to_20(monkeypatch):  
13 - monkeypatch.setattr(process_products, "API_KEY", "fake-key")  
14 - seen_batch_sizes: List[int] = []  
15 -  
16 - def _fake_process_batch(  
17 - batch_data: List[Dict[str, str]],  
18 - batch_num: int,  
19 - target_lang: str = "zh",  
20 - analysis_kind: str = "content",  
21 - category_taxonomy_profile=None,  
22 - ):  
23 - assert analysis_kind == "content"  
24 - assert category_taxonomy_profile is None  
25 - seen_batch_sizes.append(len(batch_data))  
26 - return [  
27 - {  
28 - "id": item["id"],  
29 - "lang": target_lang,  
30 - "title_input": item["title"],  
31 - "title": "",  
32 - "category_path": "",  
33 - "tags": "",  
34 - "target_audience": "",  
35 - "usage_scene": "",  
36 - "season": "",  
37 - "key_attributes": "",  
38 - "material": "",  
39 - "features": "",  
40 - "anchor_text": "",  
41 - }  
42 - for item in batch_data  
43 - ]  
44 -  
45 - monkeypatch.setattr(process_products, "process_batch", _fake_process_batch)  
46 - monkeypatch.setattr(process_products, "_set_cached_analysis_result", lambda *args, **kwargs: None)  
47 -  
48 - out = process_products.analyze_products(  
49 - products=_mk_products(45),  
50 - target_lang="zh",  
51 - batch_size=200,  
52 - tenant_id="162",  
53 - )  
54 -  
55 - assert len(out) == 45  
56 - # 并发执行时 batch 调用顺序可能变化,因此校验“批大小集合”而不是严格顺序  
57 - assert sorted(seen_batch_sizes) == [5, 20, 20]  
58 -  
59 -  
60 -def test_analyze_products_uses_min_batch_size_1(monkeypatch):  
61 - monkeypatch.setattr(process_products, "API_KEY", "fake-key")  
62 - seen_batch_sizes: List[int] = []  
63 -  
64 - def _fake_process_batch(  
65 - batch_data: List[Dict[str, str]],  
66 - batch_num: int,  
67 - target_lang: str = "zh",  
68 - analysis_kind: str = "content",  
69 - category_taxonomy_profile=None,  
70 - ):  
71 - assert analysis_kind == "content"  
72 - assert category_taxonomy_profile is None  
73 - seen_batch_sizes.append(len(batch_data))  
74 - return [  
75 - {  
76 - "id": item["id"],  
77 - "lang": target_lang,  
78 - "title_input": item["title"],  
79 - "title": "",  
80 - "category_path": "",  
81 - "tags": "",  
82 - "target_audience": "",  
83 - "usage_scene": "",  
84 - "season": "",  
85 - "key_attributes": "",  
86 - "material": "",  
87 - "features": "",  
88 - "anchor_text": "",  
89 - }  
90 - for item in batch_data  
91 - ]  
92 -  
93 - monkeypatch.setattr(process_products, "process_batch", _fake_process_batch)  
94 - monkeypatch.setattr(process_products, "_set_cached_analysis_result", lambda *args, **kwargs: None)  
95 -  
96 - out = process_products.analyze_products(  
97 - products=_mk_products(3),  
98 - target_lang="zh",  
99 - batch_size=0,  
100 - tenant_id="162",  
101 - )  
102 -  
103 - assert len(out) == 3  
104 - assert seen_batch_sizes == [1, 1, 1]  
tests/test_product_enrich_partial_mode.py deleted
@@ -1,736 +0,0 @@ @@ -1,736 +0,0 @@
1 -from __future__ import annotations  
2 -  
3 -import importlib.util  
4 -import io  
5 -import json  
6 -import logging  
7 -import sys  
8 -import types  
9 -from pathlib import Path  
10 -from unittest import mock  
11 -  
12 -  
13 -def _load_product_enrich_module():  
14 - if "dotenv" not in sys.modules:  
15 - fake_dotenv = types.ModuleType("dotenv")  
16 - fake_dotenv.load_dotenv = lambda *args, **kwargs: None  
17 - sys.modules["dotenv"] = fake_dotenv  
18 -  
19 - if "redis" not in sys.modules:  
20 - fake_redis = types.ModuleType("redis")  
21 -  
22 - class _FakeRedisClient:  
23 - def __init__(self, *args, **kwargs):  
24 - pass  
25 -  
26 - def ping(self):  
27 - return True  
28 -  
29 - fake_redis.Redis = _FakeRedisClient  
30 - sys.modules["redis"] = fake_redis  
31 -  
32 - repo_root = Path(__file__).resolve().parents[1]  
33 - if str(repo_root) not in sys.path:  
34 - sys.path.insert(0, str(repo_root))  
35 -  
36 - module_path = repo_root / "indexer" / "product_enrich.py"  
37 - spec = importlib.util.spec_from_file_location("product_enrich_under_test", module_path)  
38 - module = importlib.util.module_from_spec(spec)  
39 - assert spec and spec.loader  
40 - spec.loader.exec_module(module)  
41 - return module  
42 -  
43 -  
44 -product_enrich = _load_product_enrich_module()  
45 -  
46 -  
47 -def _attach_stream(logger_obj: logging.Logger):  
48 - stream = io.StringIO()  
49 - handler = logging.StreamHandler(stream)  
50 - handler.setFormatter(logging.Formatter("%(message)s"))  
51 - logger_obj.addHandler(handler)  
52 - return stream, handler  
53 -  
54 -  
55 -def test_create_prompt_splits_shared_context_and_localized_tail():  
56 - products = [  
57 - {"id": "1", "title": "dress"},  
58 - {"id": "2", "title": "linen shirt"},  
59 - ]  
60 -  
61 - shared_zh, user_zh, prefix_zh = product_enrich.create_prompt(products, target_lang="zh")  
62 - shared_en, user_en, prefix_en = product_enrich.create_prompt(products, target_lang="en")  
63 -  
64 - assert shared_zh == shared_en  
65 - assert "Analyze each input product text" in shared_zh  
66 - assert "1. dress" in shared_zh  
67 - assert "2. linen shirt" in shared_zh  
68 - assert "Product list" not in user_zh  
69 - assert "Product list" not in user_en  
70 - assert "specified language" in user_zh  
71 - assert "Language: Chinese" in user_zh  
72 - assert "Language: English" in user_en  
73 - assert prefix_zh.startswith("| 序号 | 商品标题 | 品类路径 |")  
74 - assert prefix_en.startswith("| No. | Product title | Category path |")  
75 -  
76 -  
77 -def test_create_prompt_supports_taxonomy_analysis_kind():  
78 - products = [{"id": "1", "title": "linen dress"}]  
79 -  
80 - shared_zh, user_zh, prefix_zh = product_enrich.create_prompt(  
81 - products,  
82 - target_lang="zh",  
83 - analysis_kind="taxonomy",  
84 - )  
85 - shared_fr, user_fr, prefix_fr = product_enrich.create_prompt(  
86 - products,  
87 - target_lang="fr",  
88 - analysis_kind="taxonomy",  
89 - )  
90 -  
91 - assert "apparel attribute taxonomy" in shared_zh  
92 - assert "1. linen dress" in shared_zh  
93 - assert "Language: Chinese" in user_zh  
94 - assert "Language: French" in user_fr  
95 - assert prefix_zh.startswith("| 序号 | 品类 | 目标性别 |")  
96 - assert prefix_fr.startswith("| No. | Product Type | Target Gender |")  
97 -  
98 -  
99 -def test_call_llm_logs_shared_context_once_and_verbose_contains_full_requests():  
100 - payloads = []  
101 - response_bodies = [  
102 - {  
103 - "choices": [  
104 - {  
105 - "message": {  
106 - "content": (  
107 - "| 1 | 连衣裙 | 女装>连衣裙 | 法式,收腰 | 年轻女性 | "  
108 - "通勤,约会 | 春季,夏季 | 中长款 | 聚酯纤维 | 透气 | "  
109 - "修身显瘦 | 法式收腰连衣裙 |\n"  
110 - )  
111 - }  
112 - }  
113 - ],  
114 - "usage": {"prompt_tokens": 120, "completion_tokens": 45, "total_tokens": 165},  
115 - },  
116 - {  
117 - "choices": [  
118 - {  
119 - "message": {  
120 - "content": (  
121 - "| 1 | Dress | Women>Dress | French,Waisted | Young women | "  
122 - "Commute,Date | Spring,Summer | Midi | Polyester | Breathable | "  
123 - "Slim fit | French waisted dress |\n"  
124 - )  
125 - }  
126 - }  
127 - ],  
128 - "usage": {"prompt_tokens": 118, "completion_tokens": 43, "total_tokens": 161},  
129 - },  
130 - ]  
131 -  
132 - class _FakeResponse:  
133 - def __init__(self, body):  
134 - self.body = body  
135 -  
136 - def raise_for_status(self):  
137 - return None  
138 -  
139 - def json(self):  
140 - return self.body  
141 -  
142 - class _FakeSession:  
143 - trust_env = True  
144 -  
145 - def post(self, url, headers=None, json=None, timeout=None, proxies=None):  
146 - del url, headers, timeout, proxies  
147 - payloads.append(json)  
148 - return _FakeResponse(response_bodies[len(payloads) - 1])  
149 -  
150 - def close(self):  
151 - return None  
152 -  
153 - product_enrich.reset_logged_shared_context_keys()  
154 - main_stream, main_handler = _attach_stream(product_enrich.logger)  
155 - verbose_stream, verbose_handler = _attach_stream(product_enrich.verbose_logger)  
156 -  
157 - try:  
158 - with mock.patch.object(product_enrich, "API_KEY", "fake-key"), mock.patch.object(  
159 - product_enrich.requests,  
160 - "Session",  
161 - lambda: _FakeSession(),  
162 - ):  
163 - zh_shared, zh_user, zh_prefix = product_enrich.create_prompt(  
164 - [{"id": "1", "title": "dress"}],  
165 - target_lang="zh",  
166 - )  
167 - en_shared, en_user, en_prefix = product_enrich.create_prompt(  
168 - [{"id": "1", "title": "dress"}],  
169 - target_lang="en",  
170 - )  
171 -  
172 - zh_markdown, zh_raw = product_enrich.call_llm(  
173 - zh_shared,  
174 - zh_user,  
175 - zh_prefix,  
176 - target_lang="zh",  
177 - )  
178 - en_markdown, en_raw = product_enrich.call_llm(  
179 - en_shared,  
180 - en_user,  
181 - en_prefix,  
182 - target_lang="en",  
183 - )  
184 - finally:  
185 - product_enrich.logger.removeHandler(main_handler)  
186 - product_enrich.verbose_logger.removeHandler(verbose_handler)  
187 -  
188 - assert zh_shared == en_shared  
189 - assert len(payloads) == 2  
190 - assert len(payloads[0]["messages"]) == 3  
191 - assert payloads[0]["messages"][1]["role"] == "user"  
192 - assert "1. dress" in payloads[0]["messages"][1]["content"]  
193 - assert "Language: Chinese" in payloads[0]["messages"][1]["content"]  
194 - assert "Language: English" in payloads[1]["messages"][1]["content"]  
195 - assert payloads[0]["messages"][-1]["partial"] is True  
196 - assert payloads[1]["messages"][-1]["partial"] is True  
197 -  
198 - main_log = main_stream.getvalue()  
199 - verbose_log = verbose_stream.getvalue()  
200 -  
201 - assert main_log.count("LLM Shared Context") == 1  
202 - assert main_log.count("LLM Request Variant") == 2  
203 - assert "Localized Requirement" in main_log  
204 - assert "Shared Context" in main_log  
205 -  
206 - assert verbose_log.count("LLM Request [model=") == 2  
207 - assert verbose_log.count("LLM Response [model=") == 2  
208 - assert '"partial": true' in verbose_log  
209 - assert "Combined User Prompt" in verbose_log  
210 - assert "French waisted dress" in verbose_log  
211 - assert "法式收腰连衣裙" in verbose_log  
212 -  
213 - assert zh_markdown.startswith(zh_prefix)  
214 - assert en_markdown.startswith(en_prefix)  
215 - assert json.loads(zh_raw)["usage"]["total_tokens"] == 165  
216 - assert json.loads(en_raw)["usage"]["total_tokens"] == 161  
217 -  
218 -  
219 -def test_process_batch_reads_result_and_validates_expected_fields():  
220 - merged_markdown = """| 序号 | 商品标题 | 品类路径 | 细分标签 | 适用人群 | 使用场景 | 适用季节 | 关键属性 | 材质说明 | 功能特点 | 锚文本 |  
221 -|----|----|----|----|----|----|----|----|----|----|----|  
222 -| 1 | 法式连衣裙 | 女装>连衣裙 | 法式,收腰 | 年轻女性 | 通勤,约会 | 春季,夏季 | 中长款 | 聚酯纤维 | 透气 | 法式收腰连衣裙 |  
223 -"""  
224 -  
225 - with mock.patch.object(  
226 - product_enrich,  
227 - "call_llm",  
228 - return_value=(merged_markdown, json.dumps({"choices": [{"message": {"content": "stub"}}]})),  
229 - ):  
230 - results = product_enrich.process_batch(  
231 - [{"id": "sku-1", "title": "dress"}],  
232 - batch_num=1,  
233 - target_lang="zh",  
234 - )  
235 -  
236 - assert len(results) == 1  
237 - row = results[0]  
238 - assert row["id"] == "sku-1"  
239 - assert row["lang"] == "zh"  
240 - assert row["title_input"] == "dress"  
241 - assert row["title"] == "法式连衣裙"  
242 - assert row["category_path"] == "女装>连衣裙"  
243 - assert row["tags"] == "法式,收腰"  
244 - assert row["target_audience"] == "年轻女性"  
245 - assert row["usage_scene"] == "通勤,约会"  
246 - assert row["season"] == "春季,夏季"  
247 - assert row["key_attributes"] == "中长款"  
248 - assert row["material"] == "聚酯纤维"  
249 - assert row["features"] == "透气"  
250 - assert row["anchor_text"] == "法式收腰连衣裙"  
251 -  
252 -  
253 -def test_process_batch_reads_taxonomy_result_with_schema_specific_fields():  
254 - merged_markdown = """| 序号 | 品类 | 目标性别 | 年龄段 | 适用季节 | 版型 | 廓形 | 领型 | 袖长类型 | 袖型 | 肩带设计 | 腰型 | 裤型 | 裙型 | 长度类型 | 闭合方式 | 设计细节 | 面料 | 成分 | 面料特性 | 服装特征 | 功能 | 主颜色 | 色系 | 印花 / 图案 | 适用场景 | 风格 |  
255 -|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|  
256 -| 1 | 连衣裙 | 女 | 成人 | 春季,夏季 | 修身 | A字 | V领 | 无袖 | | 细肩带 | 高腰 | | A字裙 | 中长款 | 拉链 | 褶皱 | 梭织 | 聚酯纤维,氨纶 | 轻薄,透气 | 有内衬 | 易打理 | 酒红色 | 红色 | 纯色 | 约会,度假 | 浪漫 |  
257 -"""  
258 -  
259 - with mock.patch.object(  
260 - product_enrich,  
261 - "call_llm",  
262 - return_value=(merged_markdown, json.dumps({"choices": [{"message": {"content": "stub"}}]})),  
263 - ):  
264 - results = product_enrich.process_batch(  
265 - [{"id": "sku-1", "title": "dress"}],  
266 - batch_num=1,  
267 - target_lang="zh",  
268 - analysis_kind="taxonomy",  
269 - )  
270 -  
271 - assert len(results) == 1  
272 - row = results[0]  
273 - assert row["id"] == "sku-1"  
274 - assert row["lang"] == "zh"  
275 - assert row["title_input"] == "dress"  
276 - assert row["product_type"] == "连衣裙"  
277 - assert row["target_gender"] == "女"  
278 - assert row["age_group"] == "成人"  
279 - assert row["sleeve_length_type"] == "无袖"  
280 - assert row["material_composition"] == "聚酯纤维,氨纶"  
281 - assert row["occasion_end_use"] == "约会,度假"  
282 - assert row["style_aesthetic"] == "浪漫"  
283 -  
284 -  
285 -def test_analyze_products_uses_product_level_cache_across_batch_requests():  
286 - cache_store = {}  
287 - process_calls = []  
288 -  
289 - def _cache_key(product, target_lang):  
290 - return (  
291 - target_lang,  
292 - product.get("title", ""),  
293 - product.get("brief", ""),  
294 - product.get("description", ""),  
295 - product.get("image_url", ""),  
296 - )  
297 -  
298 - def fake_get_cached_analysis_result(  
299 - product,  
300 - target_lang,  
301 - analysis_kind="content",  
302 - category_taxonomy_profile=None,  
303 - ):  
304 - assert analysis_kind == "content"  
305 - assert category_taxonomy_profile is None  
306 - return cache_store.get(_cache_key(product, target_lang))  
307 -  
308 - def fake_set_cached_analysis_result(  
309 - product,  
310 - target_lang,  
311 - result,  
312 - analysis_kind="content",  
313 - category_taxonomy_profile=None,  
314 - ):  
315 - assert analysis_kind == "content"  
316 - assert category_taxonomy_profile is None  
317 - cache_store[_cache_key(product, target_lang)] = result  
318 -  
319 - def fake_process_batch(  
320 - batch_data,  
321 - batch_num,  
322 - target_lang="zh",  
323 - analysis_kind="content",  
324 - category_taxonomy_profile=None,  
325 - ):  
326 - assert analysis_kind == "content"  
327 - assert category_taxonomy_profile is None  
328 - process_calls.append(  
329 - {  
330 - "batch_num": batch_num,  
331 - "target_lang": target_lang,  
332 - "titles": [item["title"] for item in batch_data],  
333 - }  
334 - )  
335 - return [  
336 - {  
337 - "id": item["id"],  
338 - "lang": target_lang,  
339 - "title_input": item["title"],  
340 - "title": f"normalized:{item['title']}",  
341 - "category_path": "cat",  
342 - "tags": "tags",  
343 - "target_audience": "audience",  
344 - "usage_scene": "scene",  
345 - "season": "season",  
346 - "key_attributes": "attrs",  
347 - "material": "material",  
348 - "features": "features",  
349 - "anchor_text": f"anchor:{item['title']}",  
350 - }  
351 - for item in batch_data  
352 - ]  
353 -  
354 - products = [  
355 - {"id": "1", "title": "dress"},  
356 - {"id": "2", "title": "shirt"},  
357 - ]  
358 -  
359 - with mock.patch.object(product_enrich, "API_KEY", "fake-key"), mock.patch.object(  
360 - product_enrich,  
361 - "_get_cached_analysis_result",  
362 - side_effect=fake_get_cached_analysis_result,  
363 - ), mock.patch.object(  
364 - product_enrich,  
365 - "_set_cached_analysis_result",  
366 - side_effect=fake_set_cached_analysis_result,  
367 - ), mock.patch.object(  
368 - product_enrich,  
369 - "process_batch",  
370 - side_effect=fake_process_batch,  
371 - ):  
372 - first = product_enrich.analyze_products(  
373 - [products[0]],  
374 - target_lang="zh",  
375 - tenant_id="170",  
376 - )  
377 - second = product_enrich.analyze_products(  
378 - products,  
379 - target_lang="zh",  
380 - tenant_id="999",  
381 - )  
382 - third = product_enrich.analyze_products(  
383 - products,  
384 - target_lang="zh",  
385 - tenant_id="170",  
386 - )  
387 -  
388 - assert [row["title_input"] for row in first] == ["dress"]  
389 - assert [row["title_input"] for row in second] == ["dress", "shirt"]  
390 - assert [row["title_input"] for row in third] == ["dress", "shirt"]  
391 -  
392 - assert process_calls == [  
393 - {"batch_num": 1, "target_lang": "zh", "titles": ["dress"]},  
394 - {"batch_num": 1, "target_lang": "zh", "titles": ["shirt"]},  
395 - ]  
396 - assert second[0]["anchor_text"] == "anchor:dress"  
397 - assert second[1]["anchor_text"] == "anchor:shirt"  
398 - assert third[0]["anchor_text"] == "anchor:dress"  
399 - assert third[1]["anchor_text"] == "anchor:shirt"  
400 -  
401 -  
402 -def test_analyze_products_reuses_cached_content_with_current_product_identity():  
403 - cached_result = {  
404 - "id": "1165",  
405 - "lang": "zh",  
406 - "title_input": "old-title",  
407 - "title": "法式连衣裙",  
408 - "category_path": "女装>连衣裙",  
409 - "enriched_tags": "法式,收腰",  
410 - "target_audience": "年轻女性",  
411 - "usage_scene": "通勤,约会",  
412 - "season": "春季,夏季",  
413 - "key_attributes": "中长款",  
414 - "material": "聚酯纤维",  
415 - "features": "透气",  
416 - "anchor_text": "法式收腰连衣裙",  
417 - }  
418 - products = [{"id": "69960", "title": "dress"}]  
419 -  
420 - with mock.patch.object(product_enrich, "API_KEY", "fake-key"), mock.patch.object(  
421 - product_enrich,  
422 - "_get_cached_analysis_result",  
423 - wraps=lambda product, target_lang, analysis_kind="content", category_taxonomy_profile=None: product_enrich._normalize_analysis_result(  
424 - cached_result,  
425 - product=product,  
426 - target_lang=target_lang,  
427 - schema=product_enrich._get_analysis_schema("content"),  
428 - ),  
429 - ), mock.patch.object(  
430 - product_enrich,  
431 - "process_batch",  
432 - side_effect=AssertionError("process_batch should not be called on cache hit"),  
433 - ):  
434 - result = product_enrich.analyze_products(  
435 - products,  
436 - target_lang="zh",  
437 - tenant_id="170",  
438 - )  
439 -  
440 - assert result == [  
441 - {  
442 - "id": "69960",  
443 - "lang": "zh",  
444 - "title_input": "dress",  
445 - "title": "法式连衣裙",  
446 - "category_path": "女装>连衣裙",  
447 - "tags": "法式,收腰",  
448 - "target_audience": "年轻女性",  
449 - "usage_scene": "通勤,约会",  
450 - "season": "春季,夏季",  
451 - "key_attributes": "中长款",  
452 - "material": "聚酯纤维",  
453 - "features": "透气",  
454 - "anchor_text": "法式收腰连衣裙",  
455 - }  
456 - ]  
457 -  
458 -  
459 -def test_build_index_content_fields_maps_internal_tags_to_enriched_tags_output():  
460 - def fake_analyze_products(  
461 - products,  
462 - target_lang="zh",  
463 - batch_size=None,  
464 - tenant_id=None,  
465 - analysis_kind="content",  
466 - category_taxonomy_profile=None,  
467 - ):  
468 - if analysis_kind == "taxonomy":  
469 - assert category_taxonomy_profile == "apparel"  
470 - return [  
471 - {  
472 - "id": products[0]["id"],  
473 - "lang": target_lang,  
474 - "title_input": products[0]["title"],  
475 - "product_type": f"{target_lang}-dress",  
476 - "target_gender": f"{target_lang}-women",  
477 - "age_group": "",  
478 - "season": f"{target_lang}-summer",  
479 - "fit": "",  
480 - "silhouette": "",  
481 - "neckline": "",  
482 - "sleeve_length_type": "",  
483 - "sleeve_style": "",  
484 - "strap_type": "",  
485 - "rise_waistline": "",  
486 - "leg_shape": "",  
487 - "skirt_shape": "",  
488 - "length_type": "",  
489 - "closure_type": "",  
490 - "design_details": "",  
491 - "fabric": "",  
492 - "material_composition": "",  
493 - "fabric_properties": "",  
494 - "clothing_features": "",  
495 - "functional_benefits": "",  
496 - "color": "",  
497 - "color_family": "",  
498 - "print_pattern": "",  
499 - "occasion_end_use": "",  
500 - "style_aesthetic": "",  
501 - }  
502 - ]  
503 - return [  
504 - {  
505 - "id": products[0]["id"],  
506 - "lang": target_lang,  
507 - "title_input": products[0]["title"],  
508 - "title": products[0]["title"],  
509 - "category_path": "玩具>滑行玩具",  
510 - "tags": f"{target_lang}-tag1,{target_lang}-tag2",  
511 - "target_audience": f"{target_lang}-audience",  
512 - "usage_scene": "",  
513 - "season": "",  
514 - "key_attributes": "",  
515 - "material": "",  
516 - "features": "",  
517 - "anchor_text": f"{target_lang}-anchor",  
518 - }  
519 - ]  
520 -  
521 - with mock.patch.object(  
522 - product_enrich,  
523 - "analyze_products",  
524 - side_effect=fake_analyze_products,  
525 - ):  
526 - result = product_enrich.build_index_content_fields(  
527 - items=[{"spu_id": "69960", "title": "dress"}],  
528 - tenant_id="170",  
529 - )  
530 -  
531 - assert result == [  
532 - {  
533 - "id": "69960",  
534 - "qanchors": {"zh": ["zh-anchor"], "en": ["en-anchor"]},  
535 - "enriched_tags": {"zh": ["zh-tag1", "zh-tag2"], "en": ["en-tag1", "en-tag2"]},  
536 - "enriched_attributes": [  
537 - {  
538 - "name": "enriched_tags",  
539 - "value": {  
540 - "zh": ["zh-tag1", "zh-tag2"],  
541 - "en": ["en-tag1", "en-tag2"],  
542 - },  
543 - },  
544 - {"name": "target_audience", "value": {"zh": ["zh-audience"], "en": ["en-audience"]}},  
545 - ],  
546 - "enriched_taxonomy_attributes": [  
547 - {  
548 - "name": "Product Type",  
549 - "value": {"zh": ["zh-dress"], "en": ["en-dress"]},  
550 - },  
551 - {  
552 - "name": "Target Gender",  
553 - "value": {"zh": ["zh-women"], "en": ["en-women"]},  
554 - },  
555 - {  
556 - "name": "Season",  
557 - "value": {"zh": ["zh-summer"], "en": ["en-summer"]},  
558 - },  
559 - ],  
560 - }  
561 - ]  
562 -def test_build_index_content_fields_non_apparel_taxonomy_returns_en_only():  
563 - seen_calls = []  
564 -  
565 - def fake_analyze_products(  
566 - products,  
567 - target_lang="zh",  
568 - batch_size=None,  
569 - tenant_id=None,  
570 - analysis_kind="content",  
571 - category_taxonomy_profile=None,  
572 - ):  
573 - seen_calls.append((analysis_kind, target_lang, category_taxonomy_profile, tuple(p["id"] for p in products)))  
574 - if analysis_kind == "taxonomy":  
575 - assert category_taxonomy_profile == "toys"  
576 - assert target_lang == "en"  
577 - return [  
578 - {  
579 - "id": products[0]["id"],  
580 - "lang": "en",  
581 - "title_input": products[0]["title"],  
582 - "product_type": "doll set",  
583 - "age_group": "kids",  
584 - "character_theme": "",  
585 - "material": "",  
586 - "power_source": "",  
587 - "interactive_features": "",  
588 - "educational_play_value": "",  
589 - "piece_count_size": "",  
590 - "color": "",  
591 - "use_scenario": "",  
592 - }  
593 - ]  
594 -  
595 - return [  
596 - {  
597 - "id": product["id"],  
598 - "lang": target_lang,  
599 - "title_input": product["title"],  
600 - "title": product["title"],  
601 - "category_path": "",  
602 - "tags": f"{target_lang}-tag",  
603 - "target_audience": "",  
604 - "usage_scene": "",  
605 - "season": "",  
606 - "key_attributes": "",  
607 - "material": "",  
608 - "features": "",  
609 - "anchor_text": f"{target_lang}-anchor",  
610 - }  
611 - for product in products  
612 - ]  
613 -  
614 - with mock.patch.object(product_enrich, "analyze_products", side_effect=fake_analyze_products):  
615 - result = product_enrich.build_index_content_fields(  
616 - items=[{"spu_id": "2", "title": "toy"}],  
617 - tenant_id="170",  
618 - category_taxonomy_profile="toys",  
619 - )  
620 -  
621 - assert result == [  
622 - {  
623 - "id": "2",  
624 - "qanchors": {"zh": ["zh-anchor"], "en": ["en-anchor"]},  
625 - "enriched_tags": {"zh": ["zh-tag"], "en": ["en-tag"]},  
626 - "enriched_attributes": [  
627 - {  
628 - "name": "enriched_tags",  
629 - "value": {  
630 - "zh": ["zh-tag"],  
631 - "en": ["en-tag"],  
632 - },  
633 - }  
634 - ],  
635 - "enriched_taxonomy_attributes": [  
636 - {"name": "Product Type", "value": {"en": ["doll set"]}},  
637 - {"name": "Age Group", "value": {"en": ["kids"]}},  
638 - ],  
639 - }  
640 - ]  
641 - assert ("taxonomy", "zh", "toys", ("2",)) not in seen_calls  
642 - assert ("taxonomy", "en", "toys", ("2",)) in seen_calls  
643 -  
644 -  
645 -def test_anchor_cache_key_depends_on_product_input_not_identifiers():  
646 - product_a = {  
647 - "id": "1",  
648 - "spu_id": "1001",  
649 - "title": "dress",  
650 - "brief": "soft cotton",  
651 - "description": "summer dress",  
652 - "image_url": "https://img/a.jpg",  
653 - }  
654 - product_b = {  
655 - "id": "2",  
656 - "spu_id": "9999",  
657 - "title": "dress",  
658 - "brief": "soft cotton",  
659 - "description": "summer dress",  
660 - "image_url": "https://img/a.jpg",  
661 - }  
662 - product_c = {  
663 - "id": "1",  
664 - "spu_id": "1001",  
665 - "title": "dress",  
666 - "brief": "soft cotton updated",  
667 - "description": "summer dress",  
668 - "image_url": "https://img/a.jpg",  
669 - }  
670 -  
671 - key_a = product_enrich._make_anchor_cache_key(product_a, "zh")  
672 - key_b = product_enrich._make_anchor_cache_key(product_b, "zh")  
673 - key_c = product_enrich._make_anchor_cache_key(product_c, "zh")  
674 -  
675 - assert key_a == key_b  
676 - assert key_a != key_c  
677 -  
678 -  
679 -def test_analysis_cache_key_isolated_by_analysis_kind():  
680 - product = {  
681 - "id": "1",  
682 - "title": "dress",  
683 - "brief": "soft cotton",  
684 - "description": "summer dress",  
685 - }  
686 -  
687 - content_key = product_enrich._make_analysis_cache_key(product, "zh", "content")  
688 - taxonomy_key = product_enrich._make_analysis_cache_key(product, "zh", "taxonomy")  
689 -  
690 - assert content_key != taxonomy_key  
691 -  
692 -  
693 -def test_analysis_cache_key_changes_when_prompt_contract_changes():  
694 - product = {  
695 - "id": "1",  
696 - "title": "dress",  
697 - "brief": "soft cotton",  
698 - "description": "summer dress",  
699 - }  
700 -  
701 - original_key = product_enrich._make_analysis_cache_key(product, "zh", "taxonomy")  
702 -  
703 - with mock.patch.object(  
704 - product_enrich,  
705 - "USER_INSTRUCTION_TEMPLATE",  
706 - "Please return JSON only. Language: {language}",  
707 - ):  
708 - changed_key = product_enrich._make_analysis_cache_key(product, "zh", "taxonomy")  
709 -  
710 - assert original_key != changed_key  
711 -  
712 -  
713 -def test_build_prompt_input_text_appends_brief_and_description_for_short_title():  
714 - product = {  
715 - "title": "T恤",  
716 - "brief": "夏季透气纯棉短袖,舒适亲肤",  
717 - "description": "100%棉,圆领版型,适合日常通勤与休闲穿搭。",  
718 - }  
719 -  
720 - text = product_enrich._build_prompt_input_text(product)  
721 -  
722 - assert text.startswith("T恤")  
723 - assert "夏季透气纯棉短袖" in text  
724 - assert "100%棉" in text  
725 -  
726 -  
727 -def test_build_prompt_input_text_truncates_non_cjk_by_words():  
728 - product = {  
729 - "title": "dress",  
730 - "brief": " ".join(f"brief{i}" for i in range(50)),  
731 - "description": " ".join(f"desc{i}" for i in range(50)),  
732 - }  
733 -  
734 - text = product_enrich._build_prompt_input_text(product)  
735 -  
736 - assert len(text.split()) <= product_enrich.PROMPT_INPUT_MAX_WORDS