Commit a32754685e174673aa2918611e14a7eea7427887
1 parent
984f14f9
已把本仓库里的 `/indexer/enrich-content` 本地实现清理掉了,并把 indexer 主链路里对这套实现的隐式依赖一起摘掉。
这次代码侧的核心变化是: - 删除了 `indexer/product_enrich.py`、`indexer/product_enrich_prompts.py` 及相关单测。 - 在 [api/routes/indexer.py](/data/saas-search/api/routes/indexer.py:55) 移除了 `/indexer/enrich-content` 路由;现在这个路径在本仓库 indexer 服务里会是 `404`,对应契约测试也已改成校验移除状态:[tests/ci/test_service_api_contracts.py](/data/saas-search/tests/ci/test_service_api_contracts.py:345)。 - 在 [api/routes/indexer.py](/data/saas-search/api/routes/indexer.py:183)、[indexer/document_transformer.py](/data/saas-search/indexer/document_transformer.py:109)、[indexer/incremental_service.py](/data/saas-search/indexer/incremental_service.py:587)、[indexer/spu_transformer.py](/data/saas-search/indexer/spu_transformer.py:223) 去掉了构建 doc 时自动补 `qanchors` / `enriched_*` 的本地 LLM 富化逻辑。现在 `build-docs` / `reindex` / `index` 只负责基础文档构建。 - 在 [config/schema.py](/data/saas-search/config/schema.py:316)、[config/loader.py](/data/saas-search/config/loader.py:824)、[config/env_config.py](/data/saas-search/config/env_config.py:37)、[config/config.yaml](/data/saas-search/config/config.yaml:32) 清掉了旧实现专用的 `product_enrich` 和 anchor cache 配置面。 重要文档也一起同步了,重点是把“能力已迁出、本仓库不再内置生成”写明确: - [README.md](/data/saas-search/README.md:113) - [docs/搜索API对接指南-00-总览与快速开始.md](</data/saas-search/docs/搜索API对接指南-00-总览与快速开始.md:108>) - [docs/搜索API对接指南-05-索引接口(Indexer).md](</data/saas-search/docs/搜索API对接指南-05-索引接口(Indexer).md:647>) - [docs/搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md](</data/saas-search/docs/搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md:441>) - [docs/工作总结-微服务性能优化与架构.md](</data/saas-search/docs/工作总结-微服务性能优化与架构.md:96>) - [docs/缓存与Redis使用说明.md](</data/saas-search/docs/缓存与Redis使用说明.md:186>) - [indexer/README.md](/data/saas-search/indexer/README.md:508) - [indexer/ANCHORS_AND_SEMANTIC_ATTRIBUTES.md](/data/saas-search/indexer/ANCHORS_AND_SEMANTIC_ATTRIBUTES.md:1) 验证做了两步: - `python3 -m compileall ...` 通过 - `source activate.sh && python -m pytest tests/ci/test_service_api_contracts.py -q` 通过,`31 passed` 剩下我认为仍有“旧信息”但暂时没动的,主要是历史记录类文档,不算当前对接口径: - [docs/issues/issue.md](/data/saas-search/docs/issues/issue.md:295) - [docs/issues/issue.txt](/data/saas-search/docs/issues/issue.txt:468) - [docs/issues/issue-2026-03-29-索引修改-done-0330.md](</data/saas-search/docs/issues/issue-2026-03-29-索引修改-done-0330.md:23>) - [docs/issues/issue-2026-04-04-增加多模态标注-TODO.md](</data/saas-search/docs/issues/issue-2026-04-04-增加多模态标注-TODO.md:1>) 另外,工作区里原本就有 `.env` 修改和未跟踪的 `AGENTS.md`,我没有动它们。
Showing
28 changed files
with
82 additions
and
4563 deletions
Show diff stats
| ... | ... | @@ -0,0 +1,17 @@ |
| 1 | +# FacetAwareMatching 协作记忆 | |
| 2 | + | |
| 3 | +## 开发原则 | |
| 4 | + | |
| 5 | +默认遵循以下错误处理原则: | |
| 6 | + | |
| 7 | +- 对于代码缺陷、逻辑疏漏、配置或资源缺失、违反统一约定等由自身原因导致的错误,应尽早暴露、快速失败,不做回退或容错处理,以保持代码精简、清晰、统一。 | |
| 8 | +- 对于线上超时、第三方接口异常等不可预见的外部错误,应提供必要的兜底、回退、重试或其他容错措施,以保证系统稳定性和业务连续性。 | |
| 9 | +- 进行功能迭代或重构时,默认直接面向最终方案和最优设计实现,不主动为历史实现、旧数据、过渡状态或遗留调用方式做兼容;优先推动代码回到统一约定和一致模型,避免长期并存的双轨逻辑、分支特判和临时过渡层。 | |
| 10 | + | |
| 11 | +## 落地要求 | |
| 12 | + | |
| 13 | +- 不要用静默吞错、默认值掩盖、隐式降级等方式隐藏内部问题。 | |
| 14 | +- 发现内部前置条件不满足时,应优先抛错、失败并暴露上下文。 | |
| 15 | +- 设计容错逻辑时,应明确区分“内部错误”和“外部错误”,避免把内部问题包装成可忽略事件。 | |
| 16 | +- 新设计一旦确定,应优先整体替换旧约定,而不是通过兼容旧行为来维持表面稳定。 | |
| 17 | +- 除非有明确、必要的外部兼容性约束,否则不要为内部历史包袱保留额外分支。 | ... | ... |
README.md
| ... | ... | @@ -110,7 +110,7 @@ source activate.sh |
| 110 | 110 | | `搜索API对接指南-01-搜索接口.md` | `POST /search/` 请求与响应 | |
| 111 | 111 | | `搜索API对接指南-02-搜索建议与即时搜索.md` | 建议 / 即时搜索 | |
| 112 | 112 | | `搜索API对接指南-03-获取文档.md` | `GET /search/{doc_id}` | |
| 113 | -| `搜索API对接指南-05-索引接口(Indexer).md` | 索引与 `build-docs` / `enrich-content` 等 | | |
| 113 | +| `搜索API对接指南-05-索引接口(Indexer).md` | 索引与 `build-docs` 等(`enrich-content` 已迁出) | | |
| 114 | 114 | | `搜索API对接指南-06-管理接口(Admin).md` | `/admin/*` | |
| 115 | 115 | | `搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md` | 6005/6006/6007/6008 等直连说明 | |
| 116 | 116 | | `搜索API对接指南-08-数据模型与字段速查.md` | 字段与数据模型 | | ... | ... |
api/routes/indexer.py
| ... | ... | @@ -7,7 +7,7 @@ |
| 7 | 7 | import asyncio |
| 8 | 8 | import re |
| 9 | 9 | from fastapi import APIRouter, HTTPException |
| 10 | -from typing import Any, Dict, List, Literal, Optional | |
| 10 | +from typing import Any, Dict, List, Optional | |
| 11 | 11 | from pydantic import BaseModel, Field |
| 12 | 12 | import logging |
| 13 | 13 | from sqlalchemy import text |
| ... | ... | @@ -19,11 +19,6 @@ logger = logging.getLogger(__name__) |
| 19 | 19 | |
| 20 | 20 | router = APIRouter(prefix="/indexer", tags=["indexer"]) |
| 21 | 21 | |
| 22 | -SUPPORTED_CATEGORY_TAXONOMY_PROFILES = ( | |
| 23 | - "apparel, 3c, bags, pet_supplies, electronics, outdoor, " | |
| 24 | - "home_appliances, home_living, wigs, beauty, accessories, toys, shoes, sports, others" | |
| 25 | -) | |
| 26 | - | |
| 27 | 22 | |
| 28 | 23 | class ReindexRequest(BaseModel): |
| 29 | 24 | """全量重建索引请求""" |
| ... | ... | @@ -64,6 +59,7 @@ class BuildDocsRequest(BaseModel): |
| 64 | 59 | 该接口是 Java 等外部索引程序正式使用的“doc 生成接口”: |
| 65 | 60 | - 上游负责:全量 / 增量调度 + 从 MySQL 查询出各表数据 |
| 66 | 61 | - 本模块负责:根据配置和算法,将原始行数据转换为与 mappings/search_products.json 一致的 ES 文档 |
| 62 | + - 注意:已迁出的 `/indexer/enrich-content` 内容理解能力不再由本接口内置生成 | |
| 67 | 63 | """ |
| 68 | 64 | tenant_id: str = Field(..., description="租户 ID,用于加载租户配置、语言策略等") |
| 69 | 65 | items: List[BuildDocItem] = Field(..., description="需要构建 doc 的 SPU 列表(含其 SKUs 和 Options)") |
| ... | ... | @@ -82,55 +78,6 @@ class BuildDocsFromDbRequest(BaseModel): |
| 82 | 78 | spu_ids: List[str] = Field(..., description="需要构建 doc 的 SPU ID 列表") |
| 83 | 79 | |
| 84 | 80 | |
| 85 | -class EnrichContentItem(BaseModel): | |
| 86 | - """单条待生成内容理解字段的商品。""" | |
| 87 | - spu_id: str = Field(..., description="SPU ID") | |
| 88 | - title: str = Field(..., description="商品标题,用于 LLM 分析生成 qanchors / enriched_tags 等") | |
| 89 | - image_url: Optional[str] = Field(None, description="商品主图 URL(预留给多模态/内容理解扩展)") | |
| 90 | - brief: Optional[str] = Field(None, description="商品简介/短描述") | |
| 91 | - description: Optional[str] = Field(None, description="商品详情/长描述") | |
| 92 | - | |
| 93 | - | |
| 94 | -class EnrichContentRequest(BaseModel): | |
| 95 | - """ | |
| 96 | - 内容理解字段生成请求:根据商品标题批量生成通用增强字段与品类 taxonomy 字段。 | |
| 97 | - 供外部 indexer 在自行组织 doc 时调用,与翻译、向量化等微服务并列。 | |
| 98 | - """ | |
| 99 | - tenant_id: str = Field(..., description="租户 ID,用于请求路由与结果归属,不参与缓存键") | |
| 100 | - items: List[EnrichContentItem] = Field(..., description="待分析的 SPU 列表(spu_id + title,可附带 brief/description/image_url)") | |
| 101 | - enrichment_scopes: Optional[List[Literal["generic", "category_taxonomy"]]] = Field( | |
| 102 | - default=None, | |
| 103 | - description=( | |
| 104 | - "要执行的增强范围。" | |
| 105 | - "`generic` 返回 qanchors/enriched_tags/enriched_attributes;" | |
| 106 | - "`category_taxonomy` 返回 enriched_taxonomy_attributes。" | |
| 107 | - "默认两者都执行。" | |
| 108 | - ), | |
| 109 | - ) | |
| 110 | - category_taxonomy_profile: str = Field( | |
| 111 | - "apparel", | |
| 112 | - description=( | |
| 113 | - "品类 taxonomy profile。默认 `apparel`。" | |
| 114 | - f"当前支持:{SUPPORTED_CATEGORY_TAXONOMY_PROFILES}。" | |
| 115 | - "其中除 `apparel` 外,其余 profile 的 taxonomy 输出仅返回 `en`。" | |
| 116 | - ), | |
| 117 | - ) | |
| 118 | - analysis_kinds: Optional[List[Literal["content", "taxonomy"]]] = Field( | |
| 119 | - default=None, | |
| 120 | - description="Deprecated alias of enrichment_scopes. `content` -> `generic`, `taxonomy` -> `category_taxonomy`.", | |
| 121 | - ) | |
| 122 | - | |
| 123 | - def resolved_enrichment_scopes(self) -> List[str]: | |
| 124 | - if self.enrichment_scopes: | |
| 125 | - return list(self.enrichment_scopes) | |
| 126 | - if self.analysis_kinds: | |
| 127 | - mapped = [] | |
| 128 | - for item in self.analysis_kinds: | |
| 129 | - mapped.append("generic" if item == "content" else "category_taxonomy") | |
| 130 | - return mapped | |
| 131 | - return ["generic", "category_taxonomy"] | |
| 132 | - | |
| 133 | - | |
| 134 | 81 | @router.post("/reindex") |
| 135 | 82 | async def reindex_all(request: ReindexRequest): |
| 136 | 83 | """ |
| ... | ... | @@ -239,8 +186,9 @@ async def build_docs(request: BuildDocsRequest): |
| 239 | 186 | |
| 240 | 187 | 使用场景: |
| 241 | 188 | - 上游(例如 Java 索引程序)已经从 MySQL 查询出了 SPU / SKU / Option 等原始行数据 |
| 242 | - - 希望复用本项目的全部“索引富化”能力(多语言、翻译、向量、规格聚合等) | |
| 189 | + - 希望复用本项目当前保留的“索引构建”能力(多语言、翻译、向量、规格聚合等) | |
| 243 | 190 | - 只需要拿到与 `mappings/search_products.json` 一致的 doc 列表,由上游自行写入 ES |
| 191 | + - 如需 `qanchors` / `enriched_attributes` / `enriched_taxonomy_attributes`,请由外部内容理解服务生成后再自行合并 | |
| 244 | 192 | """ |
| 245 | 193 | try: |
| 246 | 194 | if not request.items: |
| ... | ... | @@ -260,7 +208,6 @@ async def build_docs(request: BuildDocsRequest): |
| 260 | 208 | import pandas as pd |
| 261 | 209 | |
| 262 | 210 | docs: List[Dict[str, Any]] = [] |
| 263 | - doc_spu_rows: List[pd.Series] = [] | |
| 264 | 211 | failed: List[Dict[str, Any]] = [] |
| 265 | 212 | |
| 266 | 213 | for item in request.items: |
| ... | ... | @@ -276,7 +223,6 @@ async def build_docs(request: BuildDocsRequest): |
| 276 | 223 | spu_row=spu_row, |
| 277 | 224 | skus=skus_df, |
| 278 | 225 | options=options_df, |
| 279 | - fill_llm_attributes=False, | |
| 280 | 226 | ) |
| 281 | 227 | |
| 282 | 228 | if doc is None: |
| ... | ... | @@ -316,7 +262,6 @@ async def build_docs(request: BuildDocsRequest): |
| 316 | 262 | doc["title_embedding"] = emb0.tolist() |
| 317 | 263 | |
| 318 | 264 | docs.append(doc) |
| 319 | - doc_spu_rows.append(spu_row) | |
| 320 | 265 | except Exception as e: |
| 321 | 266 | failed.append( |
| 322 | 267 | { |
| ... | ... | @@ -325,13 +270,6 @@ async def build_docs(request: BuildDocsRequest): |
| 325 | 270 | } |
| 326 | 271 | ) |
| 327 | 272 | |
| 328 | - # 批量填充 LLM 字段(尽量攒批,每次最多 20 条;失败仅 warning,不影响 build-docs 主功能) | |
| 329 | - try: | |
| 330 | - if docs and doc_spu_rows: | |
| 331 | - transformer.fill_llm_attributes_batch(docs, doc_spu_rows) | |
| 332 | - except Exception as e: | |
| 333 | - logger.warning("Batch LLM fill failed in build-docs (tenant_id=%s): %s", request.tenant_id, e) | |
| 334 | - | |
| 335 | 273 | return { |
| 336 | 274 | "tenant_id": request.tenant_id, |
| 337 | 275 | "docs": docs, |
| ... | ... | @@ -476,101 +414,6 @@ async def build_docs_from_db(request: BuildDocsFromDbRequest): |
| 476 | 414 | raise HTTPException(status_code=500, detail=f"Internal server error: {str(e)}") |
| 477 | 415 | |
| 478 | 416 | |
| 479 | -def _run_enrich_content( | |
| 480 | - tenant_id: str, | |
| 481 | - items: List[Dict[str, str]], | |
| 482 | - enrichment_scopes: Optional[List[str]] = None, | |
| 483 | - category_taxonomy_profile: str = "apparel", | |
| 484 | -) -> List[Dict[str, Any]]: | |
| 485 | - """ | |
| 486 | - 同步执行内容理解,返回与 ES mapping 对齐的字段结构。 | |
| 487 | - 语言策略由 product_enrich 内部统一决定,路由层不参与。 | |
| 488 | - """ | |
| 489 | - from indexer.product_enrich import build_index_content_fields | |
| 490 | - | |
| 491 | - results = build_index_content_fields( | |
| 492 | - items=items, | |
| 493 | - tenant_id=tenant_id, | |
| 494 | - enrichment_scopes=enrichment_scopes, | |
| 495 | - category_taxonomy_profile=category_taxonomy_profile, | |
| 496 | - ) | |
| 497 | - return [ | |
| 498 | - { | |
| 499 | - "spu_id": item["id"], | |
| 500 | - "qanchors": item["qanchors"], | |
| 501 | - "enriched_attributes": item["enriched_attributes"], | |
| 502 | - "enriched_tags": item["enriched_tags"], | |
| 503 | - "enriched_taxonomy_attributes": item["enriched_taxonomy_attributes"], | |
| 504 | - **({"error": item["error"]} if item.get("error") else {}), | |
| 505 | - } | |
| 506 | - for item in results | |
| 507 | - ] | |
| 508 | - | |
| 509 | - | |
| 510 | -@router.post("/enrich-content") | |
| 511 | -async def enrich_content(request: EnrichContentRequest): | |
| 512 | - """ | |
| 513 | - 内容理解字段生成接口:根据商品标题批量生成通用增强字段与品类 taxonomy 字段。 | |
| 514 | - | |
| 515 | - 使用场景: | |
| 516 | - - 外部 indexer 采用「微服务组合」方式自己组织 doc 时,可调用本接口获取 LLM 生成的 | |
| 517 | - 锚文本与语义属性,再与翻译、向量化结果合并写入 ES。 | |
| 518 | - - 与 /indexer/build-docs 解耦,避免 build-docs 因 LLM 耗时过长而阻塞;调用方可 | |
| 519 | - 先拿不含 qanchors/enriched_tags/taxonomy attributes 的 doc,再异步或离线补齐本接口结果后更新 ES。 | |
| 520 | - | |
| 521 | - 实现逻辑与 indexer.product_enrich.build_index_content_fields 一致,支持多语言与 Redis 缓存。 | |
| 522 | - """ | |
| 523 | - try: | |
| 524 | - if not request.items: | |
| 525 | - raise HTTPException(status_code=400, detail="items cannot be empty") | |
| 526 | - if len(request.items) > 50: | |
| 527 | - raise HTTPException( | |
| 528 | - status_code=400, | |
| 529 | - detail="Maximum 50 items per request for enrich-content (LLM batch limit)", | |
| 530 | - ) | |
| 531 | - | |
| 532 | - items_payload = [ | |
| 533 | - { | |
| 534 | - "spu_id": it.spu_id, | |
| 535 | - "title": it.title or "", | |
| 536 | - "brief": it.brief or "", | |
| 537 | - "description": it.description or "", | |
| 538 | - "image_url": it.image_url or "", | |
| 539 | - } | |
| 540 | - for it in request.items | |
| 541 | - ] | |
| 542 | - loop = asyncio.get_event_loop() | |
| 543 | - enrichment_scopes = request.resolved_enrichment_scopes() | |
| 544 | - result = await loop.run_in_executor( | |
| 545 | - None, | |
| 546 | - lambda: _run_enrich_content( | |
| 547 | - tenant_id=request.tenant_id, | |
| 548 | - items=items_payload, | |
| 549 | - enrichment_scopes=enrichment_scopes, | |
| 550 | - category_taxonomy_profile=request.category_taxonomy_profile, | |
| 551 | - ), | |
| 552 | - ) | |
| 553 | - return { | |
| 554 | - "tenant_id": request.tenant_id, | |
| 555 | - "enrichment_scopes": enrichment_scopes, | |
| 556 | - "category_taxonomy_profile": request.category_taxonomy_profile, | |
| 557 | - "results": result, | |
| 558 | - "total": len(result), | |
| 559 | - } | |
| 560 | - except HTTPException: | |
| 561 | - raise | |
| 562 | - except RuntimeError as e: | |
| 563 | - if "DASHSCOPE_API_KEY" in str(e) or "cannot call LLM" in str(e).lower(): | |
| 564 | - raise HTTPException( | |
| 565 | - status_code=503, | |
| 566 | - detail="Content understanding service unavailable: DASHSCOPE_API_KEY not set", | |
| 567 | - ) | |
| 568 | - raise HTTPException(status_code=500, detail=str(e)) | |
| 569 | - except Exception as e: | |
| 570 | - logger.error(f"Error in enrich-content for tenant_id={request.tenant_id}: {e}", exc_info=True) | |
| 571 | - raise HTTPException(status_code=500, detail=f"Internal server error: {str(e)}") | |
| 572 | - | |
| 573 | - | |
| 574 | 417 | @router.post("/documents") |
| 575 | 418 | async def get_documents(request: GetDocumentsRequest): |
| 576 | 419 | """ | ... | ... |
config/config.yaml
| ... | ... | @@ -38,8 +38,6 @@ infrastructure: |
| 38 | 38 | retry_on_timeout: false |
| 39 | 39 | cache_expire_days: 720 |
| 40 | 40 | embedding_cache_prefix: embedding |
| 41 | - anchor_cache_prefix: product_anchors | |
| 42 | - anchor_cache_expire_days: 30 | |
| 43 | 41 | database: |
| 44 | 42 | host: null |
| 45 | 43 | port: 3306 |
| ... | ... | @@ -60,10 +58,6 @@ indexes: [] |
| 60 | 58 | assets: |
| 61 | 59 | query_rewrite_dictionary_path: config/dictionaries/query_rewrite.dict |
| 62 | 60 | |
| 63 | -# Product content understanding (LLM enrich-content) configuration | |
| 64 | -product_enrich: | |
| 65 | - max_workers: 40 | |
| 66 | - | |
| 67 | 61 | # 离线 / Web 相关性评估(scripts/evaluation、eval-web) |
| 68 | 62 | # CLI 未显式传参时使用此处默认值;search_base_url 未配置时自动为 http://127.0.0.1:{runtime.api_port} |
| 69 | 63 | search_evaluation: | ... | ... |
config/env_config.py
| ... | ... | @@ -46,8 +46,6 @@ def _redis_dict() -> Dict[str, Any]: |
| 46 | 46 | "retry_on_timeout": cfg.retry_on_timeout, |
| 47 | 47 | "cache_expire_days": cfg.cache_expire_days, |
| 48 | 48 | "embedding_cache_prefix": cfg.embedding_cache_prefix, |
| 49 | - "anchor_cache_prefix": cfg.anchor_cache_prefix, | |
| 50 | - "anchor_cache_expire_days": cfg.anchor_cache_expire_days, | |
| 51 | 49 | } |
| 52 | 50 | |
| 53 | 51 | ... | ... |
config/loader.py
| ... | ... | @@ -38,7 +38,6 @@ from config.schema import ( |
| 38 | 38 | IndexConfig, |
| 39 | 39 | InfrastructureConfig, |
| 40 | 40 | QueryConfig, |
| 41 | - ProductEnrichConfig, | |
| 42 | 41 | RedisSettings, |
| 43 | 42 | RerankConfig, |
| 44 | 43 | RerankFusionConfig, |
| ... | ... | @@ -260,10 +259,6 @@ class AppConfigLoader: |
| 260 | 259 | runtime_config = self._build_runtime_config() |
| 261 | 260 | infrastructure_config = self._build_infrastructure_config(runtime_config.environment) |
| 262 | 261 | |
| 263 | - product_enrich_raw = raw.get("product_enrich") if isinstance(raw.get("product_enrich"), dict) else {} | |
| 264 | - product_enrich_config = ProductEnrichConfig( | |
| 265 | - max_workers=int(product_enrich_raw.get("max_workers", 40)), | |
| 266 | - ) | |
| 267 | 262 | search_evaluation_config = self._build_search_evaluation_config(raw, runtime_config) |
| 268 | 263 | |
| 269 | 264 | metadata = ConfigMetadata( |
| ... | ... | @@ -275,7 +270,6 @@ class AppConfigLoader: |
| 275 | 270 | app_config = AppConfig( |
| 276 | 271 | runtime=runtime_config, |
| 277 | 272 | infrastructure=infrastructure_config, |
| 278 | - product_enrich=product_enrich_config, | |
| 279 | 273 | search=search_config, |
| 280 | 274 | services=services_config, |
| 281 | 275 | tenants=tenants_config, |
| ... | ... | @@ -288,7 +282,6 @@ class AppConfigLoader: |
| 288 | 282 | return AppConfig( |
| 289 | 283 | runtime=app_config.runtime, |
| 290 | 284 | infrastructure=app_config.infrastructure, |
| 291 | - product_enrich=app_config.product_enrich, | |
| 292 | 285 | search=app_config.search, |
| 293 | 286 | services=app_config.services, |
| 294 | 287 | tenants=app_config.tenants, |
| ... | ... | @@ -838,8 +831,6 @@ class AppConfigLoader: |
| 838 | 831 | retry_on_timeout=os.getenv("REDIS_RETRY_ON_TIMEOUT", "false").strip().lower() == "true", |
| 839 | 832 | cache_expire_days=int(os.getenv("REDIS_CACHE_EXPIRE_DAYS", 360 * 2)), |
| 840 | 833 | embedding_cache_prefix=os.getenv("REDIS_EMBEDDING_CACHE_PREFIX", "embedding"), |
| 841 | - anchor_cache_prefix=os.getenv("REDIS_ANCHOR_CACHE_PREFIX", "product_anchors"), | |
| 842 | - anchor_cache_expire_days=int(os.getenv("REDIS_ANCHOR_CACHE_EXPIRE_DAYS", 30)), | |
| 843 | 834 | ), |
| 844 | 835 | database=DatabaseSettings( |
| 845 | 836 | host=os.getenv("DB_HOST"), | ... | ... |
config/schema.py
| ... | ... | @@ -323,8 +323,6 @@ class RedisSettings: |
| 323 | 323 | retry_on_timeout: bool = False |
| 324 | 324 | cache_expire_days: int = 720 |
| 325 | 325 | embedding_cache_prefix: str = "embedding" |
| 326 | - anchor_cache_prefix: str = "product_anchors" | |
| 327 | - anchor_cache_expire_days: int = 30 | |
| 328 | 326 | |
| 329 | 327 | |
| 330 | 328 | @dataclass(frozen=True) |
| ... | ... | @@ -351,13 +349,6 @@ class InfrastructureConfig: |
| 351 | 349 | |
| 352 | 350 | |
| 353 | 351 | @dataclass(frozen=True) |
| 354 | -class ProductEnrichConfig: | |
| 355 | - """Configuration for LLM-based product content understanding (enrich-content).""" | |
| 356 | - | |
| 357 | - max_workers: int = 40 | |
| 358 | - | |
| 359 | - | |
| 360 | -@dataclass(frozen=True) | |
| 361 | 352 | class RuntimeConfig: |
| 362 | 353 | environment: str = "prod" |
| 363 | 354 | index_namespace: str = "" |
| ... | ... | @@ -430,7 +421,6 @@ class AppConfig: |
| 430 | 421 | |
| 431 | 422 | runtime: RuntimeConfig |
| 432 | 423 | infrastructure: InfrastructureConfig |
| 433 | - product_enrich: ProductEnrichConfig | |
| 434 | 424 | search: SearchConfig |
| 435 | 425 | services: ServicesConfig |
| 436 | 426 | tenants: TenantCatalogConfig | ... | ... |
docs/工作总结-微服务性能优化与架构.md
| ... | ... | @@ -93,19 +93,15 @@ instruction: "Given a shopping query, rank product titles by relevance" |
| 93 | 93 | |
| 94 | 94 | --- |
| 95 | 95 | |
| 96 | -### 5. 内容理解字段(支撑 Suggest) | |
| 96 | +### 5. 内容理解字段(已迁出) | |
| 97 | 97 | |
| 98 | -**能力**:支持根据商品标题批量生成 **qanchors**(锚文本)、**enriched_attributes**、**tags**,供索引与 suggest 使用。 | |
| 98 | +`qanchors`、`enriched_attributes`、`enriched_tags`、`enriched_taxonomy_attributes` 这些字段模型仍保留在索引结构里,`suggestion/builder.py` 等消费侧也仍可继续使用 ES 中已有的 `qanchors`。但字段生成服务与其本地实现已经迁移到独立项目,本仓库不再提供 `/indexer/enrich-content`,也不再在 indexer 构建链路内自动补齐这些字段。 | |
| 99 | 99 | |
| 100 | -**具体内容**: | |
| 101 | -- **接口**:`POST /indexer/enrich-content`(FacetAwareMatching 服务端口 **6001**)。请求体为 `items` 数组,每项含 `spu_id`、`title`(必填)及可选多语言标题等;单次请求最多 **50 条**,建议批量调用。响应 `results` 与 `items` 一一对应,每项含 `spu_id`、`qanchors`(按语言键,如 `qanchors.zh`、`qanchors.en`,逗号分隔短语)、`enriched_attributes`、`tags`。 | |
| 102 | -- **索引侧**:微服务组合方式下,调用方先拿不含 qanchors/tags 的 doc,再调用本接口补齐后写入 ES 的 `qanchors.{lang}` 等字段;索引 transformer(`indexer/document_transformer.py`、`indexer/product_enrich.py`)内也可在构建 doc 时调用内容理解逻辑,写入 `qanchors.{lang}`。 | |
| 103 | -- **Suggest 侧**:`suggestion/builder.py` 从 ES 商品索引读取 `_source: ["id", "spu_id", "title", "qanchors"]`,对 `qanchors.{lang}` 用 `_split_qanchors` 拆成词条,以 `source="qanchor"` 加入候选,排序时 `qanchor` 权重大于纯 title(`add_product("qanchor", ...)`);suggest 配置中 `sources: ["query_log", "qanchor"]` 表示候选来源包含 qanchor。 | |
| 104 | -- **实现与依赖**:内容理解内部使用大模型(需 `DASHSCOPE_API_KEY`),支持多语言与 Redis 缓存(如 `product_anchors`);逻辑与 `indexer/product_enrich` 一致。 | |
| 105 | - | |
| 106 | -**状态**:内容理解字段已接入索引与 suggest 链路;依赖内容理解(qanchors/tags)的**全量数据尚未全部完成一轮**,后续需持续跑满并校验效果。 | |
| 100 | +当前边界: | |
| 107 | 101 | |
| 108 | -详见:`indexer/ANCHORS_AND_SEMANTIC_ATTRIBUTES.md`、`docs/搜索API对接指南-05-索引接口(Indexer).md`(`enrich-content` 等)、`api/routes/indexer.py`(enrich-content 路由)。 | |
| 102 | +- 本仓库负责基础 doc 构建、多语言字段、向量、规格聚合等索引能力。 | |
| 103 | +- 独立内容理解服务负责生成 `qanchors` / `enriched_*`。 | |
| 104 | +- 上游索引程序负责把两侧结果合并后写入 ES。 | |
| 109 | 105 | |
| 110 | 106 | --- |
| 111 | 107 | |
| ... | ... | @@ -145,7 +141,7 @@ instruction: "Given a shopping query, rank product titles by relevance" |
| 145 | 141 | - **增量示例**:`./scripts/build_suggestions.sh 162 --mode incremental --overlap-minutes 30`(按 watermark 增量更新);脚本内部调用 `main.py build-suggestions --tenant-id <id> ...`。 |
| 146 | 142 | - 构建逻辑在 `suggestion/builder.py` 的 `SuggestionIndexBuilder`:从 ES 商品索引(含 `title`、`qanchors`)与查询日志等拉取数据,写入 versioned 建议索引并切换 alias。 |
| 147 | 143 | - **尚未完成的“增量机制”**:指**自动/事件驱动的增量**(如商品变更或日志写入时自动刷新建议索引);当前 incremental 模式为“按 watermark 再跑一次构建”,仍为脚本主动触发,非持续增量流水线。 |
| 148 | -- **依赖**:suggest 候选依赖商品侧 **内容理解字段**(qanchors/tags);`sources: ["query_log", "qanchor"]` 表示候选来自查询日志与 qanchor;当前内容理解未全量跑完一轮,suggest 数据会随全量重建逐步完善。 | |
| 144 | +- **依赖**:suggest 候选依赖商品侧 **内容理解字段**(qanchors/tags);`sources: ["query_log", "qanchor"]` 表示候选来自查询日志与 qanchor。字段生成职责已迁移到独立内容理解服务。 | |
| 149 | 145 | |
| 150 | 146 | 详见:`suggestion/builder.py`、`suggestion/ARCHITECTURE_V2.md`、`main.py`(build-suggestions 子命令)。 |
| 151 | 147 | |
| ... | ... | @@ -241,7 +237,7 @@ cd /data/saas-search |
| 241 | 237 | | **Embedding** | TEI 替代 SentenceTransformers/vLLM 作为文本向量后端,兼顾性能与工程化(Docker、配置化、T4 调优);图片向量由 clip-as-service 承担。 | |
| 242 | 238 | | **Reranker** | vLLM + Qwen3-Reranker-0.6B,针对 T4 做 float16、prefix caching、CUDA 图、按长度分批及 batch/长度参数搜索;高并发场景可选用 DashScope 云后端。 | |
| 243 | 239 | | **翻译** | 因 qwen-mt 限速(RPM≈60),迁移至可配置的 qwen-flash 等方案,支撑在线索引与 query;需金伟侧对索引做流量控制。 | |
| 244 | -| **内容理解** | 提供 qanchors/tags 等字段生成接口,支撑 suggest 与检索增强;全量一轮尚未完全跑满。 | | |
| 240 | +| **内容理解** | 字段模型仍可被检索与 suggest 消费,但生成服务已迁移到独立项目;本仓库不再内置该实现。 | | |
| 245 | 241 | | **架构** | Provider 动态选择翻译;service_ctl 统一监控与拉起;suggest 目前全量脚本触发,增量待做。 | |
| 246 | 242 | | **性能基线** | 向量化扩展性良好;reranker 为整链瓶颈(386 docs 约 0.6 rps);search 约 8 rps;suggest 约 200+ rps。 | |
| 247 | 243 | ... | ... |
docs/搜索API对接指南-00-总览与快速开始.md
| ... | ... | @@ -90,7 +90,6 @@ curl -X POST "http://43.166.252.75:6002/search/" \ |
| 90 | 90 | | 查询文档 | POST | `/indexer/documents` | 查询SPU文档数据(不写入ES) | |
| 91 | 91 | | 构建ES文档(正式对接) | POST | `/indexer/build-docs` | 基于上游提供的 MySQL 行数据构建 ES doc,不写入 ES,供 Java 等调用后自行写入 | |
| 92 | 92 | | 构建ES文档(测试用) | POST | `/indexer/build-docs-from-db` | 仅在测试/调试时使用,根据 `tenant_id + spu_ids` 内部查库并构建 ES doc | |
| 93 | -| 内容理解字段生成 | POST | `/indexer/enrich-content` | 根据商品标题批量生成 qanchors、enriched_attributes、tags,供微服务组合方式使用(独立服务端口 6001) | | |
| 94 | 93 | | 索引健康检查 | GET | `/indexer/health` | 检查索引服务状态 | |
| 95 | 94 | | 健康检查 | GET | `/admin/health` | 服务健康检查 | |
| 96 | 95 | | 获取配置 | GET | `/admin/config` | 获取租户配置 | |
| ... | ... | @@ -104,6 +103,8 @@ curl -X POST "http://43.166.252.75:6002/search/" \ |
| 104 | 103 | | 向量服务(图片) | 6008 | `POST /embed/image` | 图片向量化 | |
| 105 | 104 | | 翻译服务 | 6006 | `POST /translate` | 文本翻译(支持 qwen-mt / llm / deepl / 本地模型) | |
| 106 | 105 | | 重排服务 | 6007 | `POST /rerank` | 检索结果重排 | |
| 107 | -| 内容理解(独立服务) | 6001 | `POST /indexer/enrich-content` | 根据商品标题生成 qanchors、tags 等,供 indexer 微服务组合方式使用 | | |
| 106 | +--- | |
| 107 | + | |
| 108 | +> 注:`/indexer/enrich-content` 已迁移到独立项目,不再由本仓库的 Indexer 服务提供;本仓库保留 `build-docs` / `build-docs-from-db` 等索引构建接口。 | |
| 108 | 109 | |
| 109 | 110 | --- | ... | ... |
docs/搜索API对接指南-05-索引接口(Indexer).md
| ... | ... | @@ -13,7 +13,6 @@ |
| 13 | 13 | | 查询文档 | POST | `/indexer/documents` | 按 SPU ID 列表查询 ES 文档,不写入 ES | |
| 14 | 14 | | 构建 ES 文档(正式) | POST | `/indexer/build-docs` | 由上游提供 MySQL 行数据,返回 ES-ready 文档,不写 ES | |
| 15 | 15 | | 构建 ES 文档(测试) | POST | `/indexer/build-docs-from-db` | 由本服务查库并构建文档,仅测试/调试用 | |
| 16 | -| 内容理解字段生成 | POST | `/indexer/enrich-content` | 根据商品标题批量生成 qanchors、enriched_attributes、tags(供微服务组合方式使用;独立服务端口 6001) | | |
| 17 | 16 | | 索引健康检查 | GET | `/indexer/health` | 检查索引服务与数据库连接状态 | |
| 18 | 17 | |
| 19 | 18 | #### 5.0 支撑外部 indexer 的三种方式 |
| ... | ... | @@ -22,8 +21,8 @@ |
| 22 | 21 | |
| 23 | 22 | | 方式 | 说明 | 适用场景 | |
| 24 | 23 | |------|------|----------| |
| 25 | -| **1)doc 填充接口** | 调用 `POST /indexer/build-docs` 或 `POST /indexer/build-docs-from-db`,由本服务基于 MySQL 行数据构建完整 ES 文档(含多语言、向量、规格等),**不写入 ES**,由调用方自行写入。 | 希望一站式拿到 ES-ready doc,由己方控制写 ES 的时机与索引名。 | | |
| 26 | -| **2)微服务组合** | 单独调用**翻译**、**向量化**、**内容理解字段生成**等接口,由 indexer 程序自己组装 doc 并写入 ES。翻译与向量化为独立微服务(见第 7 节);内容理解为 FacetAwareMatching 独立服务接口 `POST /indexer/enrich-content`(端口 6001)。 | 需要灵活编排、或希望将 LLM/向量等耗时步骤与主链路解耦(如异步补齐 qanchors/tags)。 | | |
| 24 | +| **1)doc 填充接口** | 调用 `POST /indexer/build-docs` 或 `POST /indexer/build-docs-from-db`,由本服务基于 MySQL 行数据构建 ES 文档(含多语言、向量、规格等),**不写入 ES**,由调用方自行写入。 | 希望一站式拿到 ES-ready doc,由己方控制写 ES 的时机与索引名。 | | |
| 25 | +| **2)微服务组合** | 单独调用**翻译**、**向量化**、**外部内容理解服务**等能力,由 indexer 程序自己组装 doc 并写入 ES。翻译与向量化见第 7 节;内容理解字段生成已迁移到独立项目,不再由本仓库维护。 | 需要灵活编排、或希望将 LLM/向量等耗时步骤与主链路解耦(如异步补齐 qanchors/tags)。 | | |
| 27 | 26 | | **3)本服务直接写 ES** | 调用全量索引 `POST /indexer/reindex`、增量索引 `POST /indexer/index`(指定 SPU ID 列表),由本服务从 MySQL 拉数并直接写入 ES。 | 自建运维、联调或不需要由 Java 写 ES 的场景。 | |
| 28 | 27 | |
| 29 | 28 | - **方式 1** 与 **方式 2** 下,ES 的写入方均为外部 indexer(或 Java),职责清晰。 |
| ... | ... | @@ -645,174 +644,20 @@ curl -X POST "http://127.0.0.1:6004/indexer/build-docs-from-db" \ |
| 645 | 644 | |
| 646 | 645 | 返回结构与 `/indexer/build-docs` 相同,可直接用于对比 ES 实际文档或调试字段映射问题。 |
| 647 | 646 | |
| 648 | -### 5.8 内容理解字段生成接口 | |
| 649 | - | |
| 650 | -- **端点**: `POST /indexer/enrich-content` | |
| 651 | -- **服务**: FacetAwareMatching 独立服务(默认端口 **6001**;由 `/data/FacetAwareMatching/scripts/service_ctl.sh` 管理) | |
| 652 | -- **描述**: 根据商品内容信息批量生成 **qanchors**(锚文本)、**enriched_attributes**(通用语义属性)、**enriched_tags**(细分标签)、**enriched_taxonomy_attributes**(taxonomy 结构化属性),供外部 indexer 在「微服务组合」方式下自行拼装 doc 时使用。请求以 `items[]` 传入商品内容字段(必填/可选见下表)。接口只暴露商品内容输入,语言选择、分析维度与最终字段结构统一由 FacetAwareMatching 的 `product_enrich` 内部决定;当前返回结果与 `search_products` mapping 保持一致。单次请求在线程池中执行,避免阻塞其他接口。 | |
| 653 | - | |
| 654 | -当前支持的 `category_taxonomy_profile`: | |
| 655 | -- `apparel` | |
| 656 | -- `3c` | |
| 657 | -- `bags` | |
| 658 | -- `pet_supplies` | |
| 659 | -- `electronics` | |
| 660 | -- `outdoor` | |
| 661 | -- `home_appliances` | |
| 662 | -- `home_living` | |
| 663 | -- `wigs` | |
| 664 | -- `beauty` | |
| 665 | -- `accessories` | |
| 666 | -- `toys` | |
| 667 | -- `shoes` | |
| 668 | -- `sports` | |
| 669 | -- `others` | |
| 670 | - | |
| 671 | -说明: | |
| 672 | -- 所有 profile 的 `enriched_taxonomy_attributes.value` 都统一返回 `zh` + `en`。 | |
| 673 | -- 外部调用 `/indexer/enrich-content` 时,以请求中的 `category_taxonomy_profile` 为准。 | |
| 674 | -- 若 indexer 内部仍接入内容理解能力,taxonomy profile 请在调用侧显式传入(建议仍以租户行业配置为准)。 | |
| 647 | +### 5.8 内容理解字段生成能力(已迁出) | |
| 675 | 648 | |
| 676 | -#### 请求参数 | |
| 677 | - | |
| 678 | -```json | |
| 679 | -{ | |
| 680 | - "tenant_id": "170", | |
| 681 | - "enrichment_scopes": ["generic", "category_taxonomy"], | |
| 682 | - "category_taxonomy_profile": "apparel", | |
| 683 | - "items": [ | |
| 684 | - { | |
| 685 | - "spu_id": "223167", | |
| 686 | - "title": "纯棉短袖T恤 夏季男装", | |
| 687 | - "brief": "夏季透气纯棉短袖,舒适亲肤", | |
| 688 | - "description": "100%棉,圆领版型,适合日常通勤与休闲穿搭。", | |
| 689 | - "image_url": "https://example.com/images/223167.jpg" | |
| 690 | - }, | |
| 691 | - { | |
| 692 | - "spu_id": "223168", | |
| 693 | - "title": "12PCS Dolls with Bottles", | |
| 694 | - "image_url": "https://example.com/images/223168.jpg" | |
| 695 | - } | |
| 696 | - ] | |
| 697 | -} | |
| 698 | -``` | |
| 649 | +`/indexer/enrich-content` 已迁移到独立项目,本仓库当前的 Indexer 服务(默认端口 `6004`)**不再暴露该接口**,也**不再在** `/indexer/build-docs`、`/indexer/build-docs-from-db`、`/indexer/reindex`、`/indexer/index` 的构建链路里内置生成 `qanchors`、`enriched_attributes`、`enriched_tags`、`enriched_taxonomy_attributes`。 | |
| 699 | 650 | |
| 700 | -| 参数 | 类型 | 必填 | 默认值 | 说明 | | |
| 701 | -|------|------|------|--------|------| | |
| 702 | -| `tenant_id` | string | Y | - | 租户 ID。目前仅用于记录日志,不产生实际作用| | |
| 703 | -| `enrichment_scopes` | array[string] | N | `["generic", "category_taxonomy"]` | 选择要执行的增强范围。`generic` 生成 `qanchors`/`enriched_tags`/`enriched_attributes`,`category_taxonomy` 生成 `enriched_taxonomy_attributes` | | |
| 704 | -| `category_taxonomy_profile` | string | N | `apparel` | 品类 taxonomy profile。支持:`apparel`、`3c`、`bags`、`pet_supplies`、`electronics`、`outdoor`、`home_appliances`、`home_living`、`wigs`、`beauty`、`accessories`、`toys`、`shoes`、`sports`、`others` | | |
| 705 | -| `items` | array | Y | - | 待分析列表;**单次最多 50 条** | | |
| 706 | - | |
| 707 | -`items[]` 字段说明: | |
| 708 | - | |
| 709 | -| 字段 | 类型 | 必填 | 说明 | | |
| 710 | -|------|------|------|------| | |
| 711 | -| `spu_id` | string | Y | SPU ID,用于回填结果;目前仅用于记录日志,不产生实际作用| | |
| 712 | -| `title` | string | Y | 商品标题 | | |
| 713 | -| `image_url` | string | N | 商品主图 URL;当前仅透传,暂未参与 prompt 与缓存键,后续可用于图像/多模态内容理解 | | |
| 714 | -| `brief` | string | N | 商品简介/短描述;当前会参与 prompt 与缓存键 | | |
| 715 | -| `description` | string | N | 商品详情/长描述;当前会参与 prompt 与缓存键 | | |
| 716 | - | |
| 717 | -缓存说明: | |
| 718 | - | |
| 719 | -- 内容缓存按 **增强范围 + taxonomy profile** 拆分;`generic` 与 `category_taxonomy:apparel` 等使用不同缓存命名空间,互不污染、可独立演进。 | |
| 720 | -- 缓存键由 `analysis_kind + target_lang + prompt/schema 版本指纹 + prompt 输入文本 hash` 构成;对 category taxonomy 来说,profile 会进入 schema 标识与版本指纹。 | |
| 721 | -- 当前真正参与 prompt 输入的字段是:`title`、`brief`、`description`;这些字段任一变化,都会落到新的缓存 key。 | |
| 722 | -- `prompt/schema 版本指纹` 会综合 system prompt、shared instruction、localized table headers、result fields、user instruction template 等信息生成;因此只要提示词或输出契约变化,旧缓存会自然失效。 | |
| 723 | -- `tenant_id`、`spu_id` 只用于请求归属与结果回填,不参与缓存键。 | |
| 724 | -- 因此,输入内容与 prompt 契约都不变时可跨请求直接命中缓存;任一一侧变化,都会自然落到新的缓存 key。 | |
| 651 | +当前建议的对接方式: | |
| 725 | 652 | |
| 726 | -语言说明: | |
| 653 | +1. 调用本仓库的 `POST /indexer/build-docs` 或 `POST /indexer/build-docs-from-db` 生成基础 ES 文档。 | |
| 654 | +2. 调用独立内容理解服务生成 `qanchors` / `enriched_*` 字段。 | |
| 655 | +3. 由上游索引程序自行合并字段后写入 ES。 | |
| 727 | 656 | |
| 728 | -- 接口不接受语言控制参数。 | |
| 729 | -- 返回哪些语言、返回哪些语义维度,统一由 `indexer.product_enrich` 内部逻辑决定。 | |
| 730 | -- 当前为了与 `search_products` mapping 对齐,通用增强字段与 taxonomy 字段都统一只返回核心索引语言 `zh`、`en`。 | |
| 657 | +补充说明: | |
| 731 | 658 | |
| 732 | -批量请求建议: | |
| 733 | -- **全量**:强烈建议 尽可能 **20 个 SPU/doc** 攒成一个批次后再请求一次。 | |
| 734 | -- **增量**:可按时效要求设置时间窗口(例如 **5 分钟**),在窗口内尽可能攒到 **20 个**;达到 20 或窗口到期就发送一次请求。 | |
| 735 | -- 允许超过20,服务内部会拆分成小批次逐个处理。也允许小于20,但是将造成费用和耗时的成本上升,特别是每次请求一个doc的情况。 | |
| 736 | - | |
| 737 | -#### 响应格式 | |
| 738 | - | |
| 739 | -```json | |
| 740 | -{ | |
| 741 | - "tenant_id": "170", | |
| 742 | - "enrichment_scopes": ["generic", "category_taxonomy"], | |
| 743 | - "category_taxonomy_profile": "apparel", | |
| 744 | - "total": 2, | |
| 745 | - "results": [ | |
| 746 | - { | |
| 747 | - "spu_id": "223167", | |
| 748 | - "qanchors": { | |
| 749 | - "zh": ["短袖T恤", "纯棉", "男装", "夏季"], | |
| 750 | - "en": ["cotton t-shirt", "short sleeve", "men", "summer"] | |
| 751 | - }, | |
| 752 | - "enriched_tags": { | |
| 753 | - "zh": ["纯棉", "短袖", "男装"], | |
| 754 | - "en": ["cotton", "short sleeve", "men"] | |
| 755 | - }, | |
| 756 | - "enriched_attributes": [ | |
| 757 | - { "name": "enriched_tags", "value": { "zh": "纯棉" } }, | |
| 758 | - { "name": "usage_scene", "value": { "zh": "日常" } }, | |
| 759 | - { "name": "enriched_tags", "value": { "en": "cotton" } } | |
| 760 | - ], | |
| 761 | - "enriched_taxonomy_attributes": [ | |
| 762 | - { "name": "Product Type", "value": { "zh": ["T恤"], "en": ["t-shirt"] } }, | |
| 763 | - { "name": "Target Gender", "value": { "zh": ["男"], "en": ["men"] } }, | |
| 764 | - { "name": "Season", "value": { "zh": ["夏季"], "en": ["summer"] } } | |
| 765 | - ] | |
| 766 | - }, | |
| 767 | - { | |
| 768 | - "spu_id": "223168", | |
| 769 | - "qanchors": { | |
| 770 | - "en": ["dolls", "toys", "12pcs"] | |
| 771 | - }, | |
| 772 | - "enriched_tags": { | |
| 773 | - "en": ["dolls", "toys"] | |
| 774 | - }, | |
| 775 | - "enriched_attributes": [], | |
| 776 | - "enriched_taxonomy_attributes": [] | |
| 777 | - } | |
| 778 | - ] | |
| 779 | -} | |
| 780 | -``` | |
| 781 | - | |
| 782 | -| 字段 | 类型 | 说明 | | |
| 783 | -|------|------|------| | |
| 784 | -| `enrichment_scopes` | array | 实际执行的增强范围列表 | | |
| 785 | -| `category_taxonomy_profile` | string | 实际使用的品类 taxonomy profile | | |
| 786 | -| `results` | array | 与请求 `items` 一一对应,每项含 `spu_id`、`qanchors`、`enriched_attributes`、`enriched_tags`、`enriched_taxonomy_attributes` | | |
| 787 | -| `results[].qanchors` | object | 与 ES `qanchors` 字段同结构,按语言键返回短语数组 | | |
| 788 | -| `results[].enriched_tags` | object | 与 ES `enriched_tags` 字段同结构,按语言键返回标签数组 | | |
| 789 | -| `results[].enriched_attributes` | array | 与 ES `enriched_attributes` nested 字段同结构,每项为 `{ "name", "value": { "zh"?: "...", "en"?: "..." } }` | | |
| 790 | -| `results[].enriched_taxonomy_attributes` | array | 与 ES `enriched_taxonomy_attributes` nested 字段同结构。每项通常为 `{ "name", "value": { "zh"?: [...], "en"?: [...] } }` | | |
| 791 | -| `results[].error` | string | 若该条处理失败(如 LLM 异常),会在此字段返回错误信息 | | |
| 792 | - | |
| 793 | -**错误响应**: | |
| 794 | -- `400`: `items` 为空或超过 50 条 | |
| 795 | -- `503`: 未配置 `DASHSCOPE_API_KEY`,内容理解服务不可用 | |
| 796 | - | |
| 797 | -#### 请求示例 | |
| 798 | - | |
| 799 | -```bash | |
| 800 | -curl -X POST "http://localhost:6001/indexer/enrich-content" \ | |
| 801 | - -H "Content-Type: application/json" \ | |
| 802 | - -d '{ | |
| 803 | - "tenant_id": "163", | |
| 804 | - "enrichment_scopes": ["generic", "category_taxonomy"], | |
| 805 | - "category_taxonomy_profile": "apparel", | |
| 806 | - "items": [ | |
| 807 | - { | |
| 808 | - "spu_id": "223167", | |
| 809 | - "title": "纯棉短袖T恤 夏季男装夏季男装", | |
| 810 | - "brief": "夏季透气纯棉短袖,舒适亲肤", | |
| 811 | - "description": "100%棉,圆领版型,适合日常通勤与休闲穿搭。", | |
| 812 | - "image_url": "https://example.com/images/223167.jpg" | |
| 813 | - } | |
| 814 | - ] | |
| 815 | - }' | |
| 816 | -``` | |
| 659 | +- `search_products` mapping 仍保留上述字段,便于独立内容理解服务继续产出并写入。 | |
| 660 | +- `suggestion` 等消费侧仍可读取 ES 中已有的 `qanchors` 字段;迁移的是“生成实现”,不是字段模型本身。 | |
| 661 | +- 本文档不再维护独立内容理解服务的请求/响应细节,请以对应独立项目的文档为准。 | |
| 817 | 662 | |
| 818 | 663 | --- | ... | ... |
docs/搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md
| 1 | 1 | # 搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation) |
| 2 | 2 | |
| 3 | -本篇覆盖向量服务(Embedding)、重排服务(Reranker)、翻译服务(Translation)以及 Indexer 服务内的内容理解字段生成(原文第 7 章)。 | |
| 3 | +本篇覆盖向量服务(Embedding)、重排服务(Reranker)与翻译服务(Translation)。原先收录的 `/indexer/enrich-content` 内容理解接口已迁移到独立项目,不再由本仓库维护。 | |
| 4 | 4 | |
| 5 | 5 | ## 7. 微服务接口(向量、重排、翻译) |
| 6 | 6 | |
| ... | ... | @@ -438,14 +438,8 @@ curl "http://localhost:6006/health" |
| 438 | 438 | } |
| 439 | 439 | ``` |
| 440 | 440 | |
| 441 | -### 7.4 内容理解字段生成(Indexer 服务内) | |
| 441 | +### 7.4 内容理解字段生成(已迁出) | |
| 442 | 442 | |
| 443 | -内容理解字段生成接口部署在 **Indexer 服务**(默认端口 6004)内,与「翻译、向量化」等独立端口微服务并列,供采用**微服务组合**方式的 indexer 调用。 | |
| 444 | - | |
| 445 | -- **Base URL**: Indexer 服务地址,如 `http://localhost:6004` | |
| 446 | -- **路径**: `POST /indexer/enrich-content` | |
| 447 | -- **说明**: 根据商品标题批量生成 `qanchors`、`enriched_attributes`、`enriched_tags`、`enriched_taxonomy_attributes`,用于拼装 ES 文档。支持通过 `enrichment_scopes` 选择执行 `generic` / `category_taxonomy`,并通过 `category_taxonomy_profile` 选择对应大类的 taxonomy prompt/profile;默认执行 `generic + category_taxonomy(apparel)`。当前支持的 taxonomy profile 包括 `apparel`、`3c`、`bags`、`pet_supplies`、`electronics`、`outdoor`、`home_appliances`、`home_living`、`wigs`、`beauty`、`accessories`、`toys`、`shoes`、`sports`、`others`。所有 profile 的 taxonomy 输出都统一返回 `zh` + `en`,`category_taxonomy_profile` 只决定字段集合。内部使用大模型(需配置 `DASHSCOPE_API_KEY`),支持多语言与 Redis 缓存;单次最多 50 条,建议批量调用以提升效率。 | |
| 448 | - | |
| 449 | -请求/响应格式、示例及错误码见 [-05-索引接口(Indexer)](./搜索API对接指南-05-索引接口(Indexer).md#58-内容理解字段生成接口)。 | |
| 443 | +`/indexer/enrich-content` 已迁移到独立项目,不再属于本仓库的微服务接口集合。当前仓库中的 Indexer 服务(`6004`)不再提供该接口;如需 `qanchors` / `enriched_*` 字段,请接入对应独立服务,并与本仓库的 `build-docs` 输出在上游侧自行合并。 | |
| 450 | 444 | |
| 451 | 445 | --- | ... | ... |
docs/缓存与Redis使用说明.md
| ... | ... | @@ -4,7 +4,6 @@ |
| 4 | 4 | |
| 5 | 5 | - **文本向量缓存**(embedding 缓存) |
| 6 | 6 | - **翻译结果缓存**(Qwen-MT 等机器翻译) |
| 7 | -- **商品内容理解缓存**(锚文本 / 语义属性 / 标签) | |
| 8 | 7 | |
| 9 | 8 | 底层连接配置统一来自 `config/env_config.py` 的 `REDIS_CONFIG`: |
| 10 | 9 | |
| ... | ... | @@ -21,8 +20,6 @@ |
| 21 | 20 | |------------|----------|----------------|----------|------| |
| 22 | 21 | | 向量缓存(text/image embedding) | 文本:`{EMBEDDING_CACHE_PREFIX}:embed:norm{0|1}:{text}`;图片:`{EMBEDDING_CACHE_PREFIX}:image:embed:norm{0|1}:{url_or_path}` | **BF16 bytes**(每维 2 字节大端存储),读取后恢复为 `np.float32` | TTL=`REDIS_CONFIG["cache_expire_days"]` 天;访问时滑动过期 | 见 `embeddings/text_encoder.py`、`embeddings/image_encoder.py`、`embeddings/server.py`;前缀由 `REDIS_CONFIG["embedding_cache_prefix"]` 控制 | |
| 23 | 22 | | 翻译结果缓存(translator service) | `trans:{model}:{target_lang}:{source_text[:4]}{sha256(source_text)}` | 机翻后的单条字符串 | TTL=`services.translation.cache.ttl_seconds` 秒;可配置滑动过期 | 见 `translation/service.py` + `config/config.yaml` | |
| 24 | -| 商品内容理解缓存(anchors / 语义属性 / tags) | `{ANCHOR_CACHE_PREFIX}:{tenant_or_global}:{target_lang}:{md5(title)}` | `json.dumps(dict)`,包含 id/title/category/tags/anchor_text 等 | TTL=`ANCHOR_CACHE_EXPIRE_DAYS` 天 | 见 `indexer/product_enrich.py` | | |
| 25 | - | |
| 26 | 23 | 下面按模块详细说明。 |
| 27 | 24 | |
| 28 | 25 | --- |
| ... | ... | @@ -186,69 +183,9 @@ services: |
| 186 | 183 | |
| 187 | 184 | --- |
| 188 | 185 | |
| 189 | -## 4. 商品内容理解缓存(indexer/product_enrich.py) | |
| 190 | - | |
| 191 | -- **代码位置**:`indexer/product_enrich.py` | |
| 192 | -- **用途**:在生成商品锚文本(qanchors)、语义属性、标签等内容理解结果时复用缓存,避免对同一标题重复调用大模型。 | |
| 193 | - | |
| 194 | -### 4.1 Key 设计 | |
| 195 | - | |
| 196 | -- 配置项: | |
| 197 | - - `ANCHOR_CACHE_PREFIX = REDIS_CONFIG.get("anchor_cache_prefix", "product_anchors")` | |
| 198 | - - `ANCHOR_CACHE_EXPIRE_DAYS = int(REDIS_CONFIG.get("anchor_cache_expire_days", 30))` | |
| 199 | -- Key 构造函数:`_make_analysis_cache_key(product, target_lang, analysis_kind)` | |
| 200 | -- 模板: | |
| 201 | - | |
| 202 | -```text | |
| 203 | -{ANCHOR_CACHE_PREFIX}:{analysis_kind}:{prompt_contract_hash}:{target_lang}:{prompt_input_prefix}{md5(prompt_input)} | |
| 204 | -``` | |
| 205 | - | |
| 206 | -- 字段说明: | |
| 207 | - - `ANCHOR_CACHE_PREFIX`:默认 `"product_anchors"`,可通过 `.env` 中的 `REDIS_ANCHOR_CACHE_PREFIX`(若存在)间接配置到 `REDIS_CONFIG`; | |
| 208 | - - `analysis_kind`:分析族,目前至少包括 `content` 与 `taxonomy`,两者缓存隔离; | |
| 209 | - - `prompt_contract_hash`:基于 system prompt、shared instruction、localized headers、result fields、user instruction template、schema cache version 等生成的短 hash;只要提示词或输出契约变化,缓存会自动失效; | |
| 210 | - - `target_lang`:内容理解输出语言,例如 `zh`; | |
| 211 | - - `prompt_input_prefix + md5(prompt_input)`:对真正送入 prompt 的商品文本做前缀 + MD5;当前 prompt 输入来自 `title`、`brief`、`description` 的规范化拼接结果。 | |
| 212 | - | |
| 213 | -设计原则: | |
| 214 | - | |
| 215 | -- 只让**实际影响 LLM 输出**的输入参与 key; | |
| 216 | -- 不让 `tenant_id`、`spu_id` 这类“结果归属信息”污染缓存; | |
| 217 | -- prompt 或 schema 变更时,不依赖人工清理 Redis,也能自然切换到新 key。 | |
| 218 | - | |
| 219 | -### 4.2 Value 与类型 | |
| 220 | - | |
| 221 | -- 类型:`json.dumps(dict, ensure_ascii=False)`。 | |
| 222 | -- 典型结构(简化): | |
| 223 | - | |
| 224 | -```json | |
| 225 | -{ | |
| 226 | - "id": "123", | |
| 227 | - "lang": "zh", | |
| 228 | - "title_input": "原始标题", | |
| 229 | - "title": "归一化后的商品标题", | |
| 230 | - "category_path": "...", | |
| 231 | - "tags": "...", | |
| 232 | - "target_audience": "...", | |
| 233 | - "usage_scene": "...", | |
| 234 | - "anchor_text": "..., ..." | |
| 235 | -} | |
| 236 | -``` | |
| 237 | - | |
| 238 | -- 读取时通过 `json.loads(raw)` 还原为 `Dict[str, Any]`。 | |
| 239 | -- `content` 与 `taxonomy` 的 value 结构会随各自 schema 不同而不同,但都会先通过统一的 normalize 逻辑再写缓存。 | |
| 240 | - | |
| 241 | -### 4.3 过期策略 | |
| 242 | - | |
| 243 | -- TTL:`ttl = ANCHOR_CACHE_EXPIRE_DAYS * 24 * 3600` 秒(默认 30 天); | |
| 244 | -- 写入:`redis.setex(key, ttl, json.dumps(result, ensure_ascii=False))`; | |
| 245 | -- 读取:仅做 `redis.get(key)`,**不做滑动过期**。 | |
| 246 | - | |
| 247 | -### 4.4 调用流程中的位置 | |
| 186 | +## 4. 商品内容理解缓存(已迁出) | |
| 248 | 187 | |
| 249 | -- 单条调用(索引阶段常见)时,`analyze_products()` 会先尝试命中缓存: | |
| 250 | - - 若命中,直接返回缓存结果; | |
| 251 | - - 若 miss,调用 LLM,解析结果后再写入缓存。 | |
| 188 | +本仓库原先存在一套用于 `qanchors` / `enriched_*` 生成的 Redis 缓存实现,但对应内容理解服务已经迁移到独立项目,当前仓库代码中不再读写这类缓存,也不再把它作为运行时能力的一部分维护。 | |
| 252 | 189 | |
| 253 | 190 | --- |
| 254 | 191 | |
| ... | ... | @@ -258,24 +195,24 @@ services: |
| 258 | 195 | |
| 259 | 196 | ### 5.1 redis_cache_health_check.py(缓存健康巡检) |
| 260 | 197 | |
| 261 | -**功能**:按**业务缓存类型**(embedding / translation / anchors)做健康巡检,不扫全库。 | |
| 198 | +**功能**:按**业务缓存类型**(embedding / translation)做健康巡检,不扫全库。 | |
| 262 | 199 | |
| 263 | 200 | - 对每类缓存:SCAN 匹配对应 key 前缀,统计**匹配 key 数量**(受 `--max-scan` 上限约束); |
| 264 | 201 | - **TTL 分布**:对采样 key 统计 `no-expire-or-expired` / `0-1h` / `1h-1d` / `1d-30d` / `>30d`; |
| 265 | 202 | - **近期活跃 key**:从采样中选出 `OBJECT IDLETIME <= 600s` 的 key,用于判断是否有新写入; |
| 266 | -- **样本 key 与 value 预览**:对 embedding 显示 ndarray 信息,对 translation 显示译文片段,对 anchors 显示 JSON 摘要。 | |
| 203 | +- **样本 key 与 value 预览**:对 embedding 显示 ndarray 信息,对 translation 显示译文片段。 | |
| 267 | 204 | |
| 268 | -**适用场景**:日常查看三类缓存是否在增长、TTL 是否合理、是否有近期写入;与「缓存总览表」中的 key 设计一一对应。 | |
| 205 | +**适用场景**:日常查看两类缓存是否在增长、TTL 是否合理、是否有近期写入;与「缓存总览表」中的 key 设计一一对应。 | |
| 269 | 206 | |
| 270 | 207 | **用法示例**: |
| 271 | 208 | |
| 272 | 209 | ```bash |
| 273 | -# 默认:检查 embedding / translation / anchors 三类 | |
| 210 | +# 默认:检查 embedding / translation 两类 | |
| 274 | 211 | python scripts/redis/redis_cache_health_check.py |
| 275 | 212 | |
| 276 | -# 只检查某一类或两类 | |
| 213 | +# 只检查某一类 | |
| 277 | 214 | python scripts/redis/redis_cache_health_check.py --type embedding |
| 278 | -python scripts/redis/redis_cache_health_check.py --type translation anchors | |
| 215 | +python scripts/redis/redis_cache_health_check.py --type translation | |
| 279 | 216 | |
| 280 | 217 | # 按自定义 pattern 检查(不按业务类型) |
| 281 | 218 | python scripts/redis/redis_cache_health_check.py --pattern "mycache:*" |
| ... | ... | @@ -288,7 +225,7 @@ python scripts/redis/redis_cache_health_check.py --sample-size 100 --max-scan 50 |
| 288 | 225 | |
| 289 | 226 | | 参数 | 说明 | 默认 | |
| 290 | 227 | |------|------|------| |
| 291 | -| `--type` | 缓存类型:`embedding` / `translation` / `anchors`,可多选 | 三类都检查 | | |
| 228 | +| `--type` | 缓存类型:`embedding` / `translation`,可多选 | 两类都检查 | | |
| 292 | 229 | | `--pattern` | 自定义 key pattern(如 `mycache:*`),指定后忽略 `--type` | - | |
| 293 | 230 | | `--db` | Redis 数据库编号 | 0 | |
| 294 | 231 | | `--sample-size` | 每类采样的 key 数量 | 50 | |
| ... | ... | @@ -319,7 +256,7 @@ python scripts/redis/redis_cache_prefix_stats.py --all-db |
| 319 | 256 | python scripts/redis/redis_cache_prefix_stats.py --db 1 |
| 320 | 257 | |
| 321 | 258 | # 只统计指定前缀(可多个) |
| 322 | -python scripts/redis/redis_cache_prefix_stats.py --prefix trans embedding product_anchors | |
| 259 | +python scripts/redis/redis_cache_prefix_stats.py --prefix trans embedding | |
| 323 | 260 | |
| 324 | 261 | # 全 DB + 指定前缀 |
| 325 | 262 | python scripts/redis/redis_cache_prefix_stats.py --all-db --prefix trans embedding |
| ... | ... | @@ -369,7 +306,7 @@ python scripts/redis/redis_memory_heavy_keys.py --top 100 |
| 369 | 306 | |
| 370 | 307 | | 需求 | 推荐脚本 | |
| 371 | 308 | |------|----------| |
| 372 | -| 看三类业务缓存(embedding/translation/anchors)的数量、TTL、近期写入、样本 value | `redis_cache_health_check.py` | | |
| 309 | +| 看两类业务缓存(embedding/translation)的数量、TTL、近期写入、样本 value | `redis_cache_health_check.py` | | |
| 373 | 310 | | 看全库或某前缀的 key 条数与内存占比 | `redis_cache_prefix_stats.py` | |
| 374 | 311 | | 找占用内存最多的大 key、分析内存差异 | `redis_memory_heavy_keys.py` | |
| 375 | 312 | ... | ... |
indexer/ANCHORS_AND_SEMANTIC_ATTRIBUTES.md
| 1 | -## qanchors 与 enriched_attributes 设计与索引逻辑说明 | |
| 1 | +# qanchors 与 enriched_* 字段说明 | |
| 2 | 2 | |
| 3 | -本文档详细说明: | |
| 3 | +本文档原先记录本仓库内的内容理解实现细节。自 2026-04 起,这部分生成能力已经迁移到独立项目,本仓库不再维护 `/indexer/enrich-content` 路由,也不再在 indexer 构建链路内自动补齐这些字段。 | |
| 4 | 4 | |
| 5 | -- **锚文本字段 `qanchors.{lang}` 的作用与来源** | |
| 6 | -- **语义属性字段 `enriched_attributes` 的结构、用途与写入流程** | |
| 7 | -- **多语言支持策略(zh / en / de / ru / fr)** | |
| 8 | -- **索引阶段与 LLM 调用的集成方式** | |
| 5 | +当前状态: | |
| 9 | 6 | |
| 10 | -本设计已默认开启,无需额外开关;在上游 LLM 不可用时会自动降级为“无锚点/语义属性”,不影响主索引流程。 | |
| 11 | - | |
| 12 | ---- | |
| 13 | - | |
| 14 | -### 1. 字段设计概览 | |
| 15 | - | |
| 16 | -#### 1.1 `qanchors.{lang}`:面向查询的锚文本 | |
| 17 | - | |
| 18 | -- **Mapping 位置**:`mappings/search_products.json` 中的 `qanchors` 对象。 | |
| 19 | -- **结构**(与 `title.{lang}` 一致): | |
| 20 | - | |
| 21 | -```140:182:/home/tw/saas-search/mappings/search_products.json | |
| 22 | -"qanchors": { | |
| 23 | - "type": "object", | |
| 24 | - "properties": { | |
| 25 | - "zh": { "type": "text", "analyzer": "index_ik", "search_analyzer": "query_ik" }, | |
| 26 | - "en": { "type": "text", "analyzer": "english" }, | |
| 27 | - "de": { "type": "text", "analyzer": "german" }, | |
| 28 | - "ru": { "type": "text", "analyzer": "russian" }, | |
| 29 | - "fr": { "type": "text", "analyzer": "french" }, | |
| 30 | - ... | |
| 31 | - } | |
| 32 | -} | |
| 33 | -``` | |
| 34 | - | |
| 35 | -- **语义**: | |
| 36 | - 用于承载“更接近用户自然搜索行为”的词/短语(query-style anchors),包括: | |
| 37 | - - 品类 + 细分类别表达; | |
| 38 | - - 使用场景(通勤、约会、度假、office outfit 等); | |
| 39 | - - 适用人群(年轻女性、plus size、teen boys 等); | |
| 40 | - - 材质 / 关键属性 / 功能特点等。 | |
| 41 | - | |
| 42 | -- **使用场景**: | |
| 43 | - - 主搜索:作为额外的全文字段参与 BM25 召回与打分(可在 `search/query_config.py` 中给一定权重); | |
| 44 | - - Suggestion:`suggestion/builder.py` 会从 `qanchors.{lang}` 中拆分词条作为候选(`source="qanchor"`,权重大于 `title`)。 | |
| 45 | - | |
| 46 | -#### 1.2 `enriched_attributes`:面向过滤/分面的通用语义属性 | |
| 47 | - | |
| 48 | -- **Mapping 位置**:`mappings/search_products.json`,追加的 nested 字段。 | |
| 49 | -- **结构**: | |
| 50 | - | |
| 51 | -```1392:1410:/home/tw/saas-search/mappings/search_products.json | |
| 52 | -"enriched_attributes": { | |
| 53 | - "type": "nested", | |
| 54 | - "properties": { | |
| 55 | - "lang": { "type": "keyword" }, // 语言:zh / en / de / ru / fr | |
| 56 | - "name": { "type": "keyword" }, // 维度名:usage_scene / target_audience / material / ... | |
| 57 | - "value": { "type": "keyword" } // 维度值:通勤 / office / Baumwolle ... | |
| 58 | - } | |
| 59 | -} | |
| 60 | -``` | |
| 61 | - | |
| 62 | -- **语义**: | |
| 63 | - - 将 LLM 输出的各维度信息统一规约到 `name/value/lang` 三元组; | |
| 64 | - - 维度名稳定、值内容可变,便于后续扩展新的语义维度而不需要修改 mapping。 | |
| 65 | - | |
| 66 | -- **当前支持的维度名**(在 `document_transformer.py` 中固定列表): | |
| 67 | - - `tags`:细分标签/风格标签; | |
| 68 | - - `target_audience`:适用人群; | |
| 69 | - - `usage_scene`:使用场景; | |
| 70 | - - `season`:适用季节; | |
| 71 | - - `key_attributes`:关键属性; | |
| 72 | - - `material`:材质说明; | |
| 73 | - - `features`:功能特点。 | |
| 74 | - | |
| 75 | -- **使用场景**: | |
| 76 | - - 按语义维度过滤: | |
| 77 | - - 例:只要“适用人群=年轻女性”的商品; | |
| 78 | - - 例:`usage_scene` 包含 “office” 或 “通勤”。 | |
| 79 | - - 按语义维度分面 / 展示筛选项: | |
| 80 | - - 例:展示当前结果中所有 `usage_scene` 的分布,供前端勾选; | |
| 81 | - - 例:展示所有 `material` 值 + 命中文档数。 | |
| 82 | - | |
| 83 | ---- | |
| 84 | - | |
| 85 | -### 2. LLM 分析服务:`indexer/product_annotator.py` | |
| 86 | - | |
| 87 | -#### 2.1 入口函数:`analyze_products` | |
| 88 | - | |
| 89 | -- **文件**:`indexer/product_annotator.py` | |
| 90 | -- **函数签名**: | |
| 91 | - | |
| 92 | -```365:392:/home/tw/saas-search/indexer/product_annotator.py | |
| 93 | -def analyze_products( | |
| 94 | - products: List[Dict[str, str]], | |
| 95 | - target_lang: str = "zh", | |
| 96 | - batch_size: Optional[int] = None, | |
| 97 | -) -> List[Dict[str, Any]]: | |
| 98 | - """ | |
| 99 | - 库调用入口:根据输入+语言,返回锚文本及各维度信息。 | |
| 100 | - | |
| 101 | - Args: | |
| 102 | - products: [{"id": "...", "title": "..."}] | |
| 103 | - target_lang: 输出语言,需在 SUPPORTED_LANGS 内 | |
| 104 | - batch_size: 批大小,默认使用全局 BATCH_SIZE | |
| 105 | - """ | |
| 106 | - ... | |
| 107 | -``` | |
| 108 | - | |
| 109 | -- **支持的输出语言**(在同文件中定义): | |
| 110 | - | |
| 111 | -```54:62:/home/tw/saas-search/indexer/product_annotator.py | |
| 112 | -LANG_LABELS: Dict[str, str] = { | |
| 113 | - "zh": "中文", | |
| 114 | - "en": "英文", | |
| 115 | - "de": "德文", | |
| 116 | - "ru": "俄文", | |
| 117 | - "fr": "法文", | |
| 118 | -} | |
| 119 | -SUPPORTED_LANGS = set(LANG_LABELS.keys()) | |
| 120 | -``` | |
| 121 | - | |
| 122 | -- **返回结构**(每个商品一条记录): | |
| 123 | - | |
| 124 | -```python | |
| 125 | -{ | |
| 126 | - "id": "<SPU_ID>", | |
| 127 | - "lang": "<zh|en|de|ru|fr>", | |
| 128 | - "title_input": "<原始输入标题>", | |
| 129 | - "title": "<目标语言的标题>", | |
| 130 | - "category_path": "<LLM 生成的品类路径>", | |
| 131 | - "tags": "<逗号分隔的细分标签>", | |
| 132 | - "target_audience": "<逗号分隔的适用人群>", | |
| 133 | - "usage_scene": "<逗号分隔的使用场景>", | |
| 134 | - "season": "<逗号分隔的适用季节>", | |
| 135 | - "key_attributes": "<逗号分隔的关键属性>", | |
| 136 | - "material": "<逗号分隔的材质说明>", | |
| 137 | - "features": "<逗号分隔的功能特点>", | |
| 138 | - "anchor_text": "<逗号分隔的锚文本短语>", | |
| 139 | - # 若发生错误,还会附带: | |
| 140 | - # "error": "<异常信息>" | |
| 141 | -} | |
| 142 | -``` | |
| 143 | - | |
| 144 | -> 注意:表格中的多值字段(标签/场景/人群/材质等)约定为**使用逗号分隔**,后续索引端会统一按正则 `[,;|/\\n\\t]+` 再拆分为短语。 | |
| 145 | - | |
| 146 | -#### 2.2 Prompt 设计与语言控制 | |
| 147 | - | |
| 148 | -- Prompt 中会明确要求“**所有输出内容使用目标语言**”,并给出中英文示例: | |
| 149 | - | |
| 150 | -```65:81:/home/tw/saas-search/indexer/product_annotator.py | |
| 151 | -def create_prompt(products: List[Dict[str, str]], target_lang: str = "zh") -> str: | |
| 152 | - """创建LLM提示词(根据目标语言输出)""" | |
| 153 | - lang_label = LANG_LABELS.get(target_lang, "对应语言") | |
| 154 | - prompt = f"""请对输入的每条商品标题,分析并提取以下信息,所有输出内容请使用{lang_label}: | |
| 155 | - | |
| 156 | -1. 商品标题:将输入商品名称翻译为{lang_label} | |
| 157 | -2. 品类路径:从大类到细分品类,用">"分隔(例如:服装>女装>裤子>工装裤) | |
| 158 | -3. 细分标签:商品的风格、特点、功能等(例如:碎花,收腰,法式) | |
| 159 | -4. 适用人群:性别/年龄段等(例如:年轻女性) | |
| 160 | -5. 使用场景 | |
| 161 | -6. 适用季节 | |
| 162 | -7. 关键属性 | |
| 163 | -8. 材质说明 | |
| 164 | -9. 功能特点 | |
| 165 | -10. 商品卖点:分析和提取一句话核心卖点,用于推荐理由 | |
| 166 | -11. 锚文本:生成一组能够代表该商品、并可能被用户用于搜索的词语或短语。这些词语应覆盖用户需求的各个维度,如品类、细分标签、功能特性、需求场景等等。 | |
| 167 | -""" | |
| 168 | -``` | |
| 169 | - | |
| 170 | -- 返回格式固定为 Markdown 表格,首行头为: | |
| 171 | - | |
| 172 | -```89:91:/home/tw/saas-search/indexer/product_annotator.py | |
| 173 | -| 序号 | 商品标题 | 品类路径 | 细分标签 | 适用人群 | 使用场景 | 适用季节 | 关键属性 | 材质说明 | 功能特点 | 商品卖点 | 锚文本 | | |
| 174 | -|----|----|----|----|----|----|----|----|----|----|----|----| | |
| 175 | -``` | |
| 176 | - | |
| 177 | -`parse_markdown_table` 会按表格列顺序解析成字段。 | |
| 178 | - | |
| 179 | ---- | |
| 180 | - | |
| 181 | -### 3. 索引阶段集成:`SPUDocumentTransformer._fill_llm_attributes` | |
| 182 | - | |
| 183 | -#### 3.1 调用时机 | |
| 184 | - | |
| 185 | -在 `SPUDocumentTransformer.transform_spu_to_doc(...)` 的末尾,在所有基础字段(多语言文本、类目、SKU/规格、价格、库存等)填充完成后,会调用: | |
| 186 | - | |
| 187 | -```96:101:/home/tw/saas-search/indexer/document_transformer.py | |
| 188 | - # 文本字段处理(翻译等) | |
| 189 | - self._fill_text_fields(doc, spu_row, primary_lang) | |
| 190 | - | |
| 191 | - # 标题向量化 | |
| 192 | - if self.enable_title_embedding and self.encoder: | |
| 193 | - self._fill_title_embedding(doc) | |
| 194 | - ... | |
| 195 | - # 时间字段 | |
| 196 | - ... | |
| 197 | - | |
| 198 | - # 基于 LLM 的锚文本与语义属性(默认开启,失败时仅记录日志) | |
| 199 | - self._fill_llm_attributes(doc, spu_row) | |
| 200 | -``` | |
| 201 | - | |
| 202 | -也就是说,**每个 SPU 文档默认会尝试补充 qanchors 与 enriched_attributes**。 | |
| 203 | - | |
| 204 | -#### 3.2 语言选择策略 | |
| 205 | - | |
| 206 | -在 `_fill_llm_attributes` 内部: | |
| 207 | - | |
| 208 | -```148:164:/home/tw/saas-search/indexer/document_transformer.py | |
| 209 | - try: | |
| 210 | - index_langs = self.tenant_config.get("index_languages") or ["en", "zh"] | |
| 211 | - except Exception: | |
| 212 | - index_langs = ["en", "zh"] | |
| 213 | - | |
| 214 | - # 只在支持的语言集合内调用 | |
| 215 | - llm_langs = [lang for lang in index_langs if lang in SUPPORTED_LANGS] | |
| 216 | - if not llm_langs: | |
| 217 | - return | |
| 218 | -``` | |
| 219 | - | |
| 220 | -- `tenant_config.index_languages` 决定该租户希望在索引中支持哪些语言; | |
| 221 | -- 实际调用 LLM 的语言集合 = `index_languages ∩ SUPPORTED_LANGS`; | |
| 222 | -- 当前 SUPPORTED_LANGS:`{"zh", "en", "de", "ru", "fr"}`。 | |
| 223 | - | |
| 224 | -这保证了: | |
| 225 | - | |
| 226 | -- 如果租户只索引 `zh`,就只跑中文; | |
| 227 | -- 如果租户同时索引 `en` + `de`,就为这两种语言各跑一次 LLM; | |
| 228 | -- 如果 `index_languages` 里包含暂不支持的语言(例如 `es`),会被自动忽略。 | |
| 229 | - | |
| 230 | -#### 3.3 调用 LLM 并写入字段 | |
| 231 | - | |
| 232 | -核心逻辑(简化描述): | |
| 233 | - | |
| 234 | -```164:210:/home/tw/saas-search/indexer/document_transformer.py | |
| 235 | - spu_id = str(spu_row.get("id") or "").strip() | |
| 236 | - title = str(spu_row.get("title") or "").strip() | |
| 237 | - if not spu_id or not title: | |
| 238 | - return | |
| 239 | - | |
| 240 | - semantic_list = doc.get("enriched_attributes") or [] | |
| 241 | - qanchors_obj = doc.get("qanchors") or {} | |
| 242 | - | |
| 243 | - dim_keys = [ | |
| 244 | - "tags", | |
| 245 | - "target_audience", | |
| 246 | - "usage_scene", | |
| 247 | - "season", | |
| 248 | - "key_attributes", | |
| 249 | - "material", | |
| 250 | - "features", | |
| 251 | - ] | |
| 252 | - | |
| 253 | - for lang in llm_langs: | |
| 254 | - try: | |
| 255 | - rows = analyze_products( | |
| 256 | - products=[{"id": spu_id, "title": title}], | |
| 257 | - target_lang=lang, | |
| 258 | - batch_size=1, | |
| 259 | - ) | |
| 260 | - except Exception as e: | |
| 261 | - logger.warning("LLM attribute fill failed for SPU %s, lang=%s: %s", spu_id, lang, e) | |
| 262 | - continue | |
| 263 | - | |
| 264 | - if not rows: | |
| 265 | - continue | |
| 266 | - row = rows[0] or {} | |
| 267 | - | |
| 268 | - # qanchors.{lang} | |
| 269 | - anchor_text = str(row.get("anchor_text") or "").strip() | |
| 270 | - if anchor_text: | |
| 271 | - qanchors_obj[lang] = anchor_text | |
| 272 | - | |
| 273 | - # 语义属性 | |
| 274 | - for name in dim_keys: | |
| 275 | - raw = row.get(name) | |
| 276 | - if not raw: | |
| 277 | - continue | |
| 278 | - parts = re.split(r"[,;|/\n\t]+", str(raw)) | |
| 279 | - for part in parts: | |
| 280 | - value = part.strip() | |
| 281 | - if not value: | |
| 282 | - continue | |
| 283 | - semantic_list.append( | |
| 284 | - { | |
| 285 | - "lang": lang, | |
| 286 | - "name": name, | |
| 287 | - "value": value, | |
| 288 | - } | |
| 289 | - ) | |
| 290 | - | |
| 291 | - if qanchors_obj: | |
| 292 | - doc["qanchors"] = qanchors_obj | |
| 293 | - if semantic_list: | |
| 294 | - doc["enriched_attributes"] = semantic_list | |
| 295 | -``` | |
| 296 | - | |
| 297 | -要点: | |
| 298 | - | |
| 299 | -- 每种语言**单独调用一次** `analyze_products`,传入同一 SPU 的原始标题; | |
| 300 | -- 将返回的 `anchor_text` 直接写入 `qanchors.{lang}`,其内部仍是逗号分隔短语,后续 suggestion builder 会再拆分; | |
| 301 | -- 对各维度字段(tags/usage_scene/...)用统一正则进行“松散拆词”,过滤空串后,以 `(lang,name,value)` 三元组追加到 nested 数组; | |
| 302 | -- 如果某个维度在该语言下为空,则跳过,不写入任何条目。 | |
| 303 | - | |
| 304 | -#### 3.4 容错 & 降级策略 | |
| 305 | - | |
| 306 | -- 如果: | |
| 307 | - - 没有 `title`; | |
| 308 | - - 或者 `tenant_config.index_languages` 与 `SUPPORTED_LANGS` 没有交集; | |
| 309 | - - 或 `DASHSCOPE_API_KEY` 未配置 / LLM 请求报错; | |
| 310 | -- 则 `_fill_llm_attributes` 会在日志中输出 `warning`,**不会抛异常**,索引流程继续,只是该 SPU 在这一轮不会得到 `qanchors` / `enriched_attributes`。 | |
| 311 | - | |
| 312 | -这保证了整个索引服务在 LLM 不可用时表现为一个普通的“传统索引”,而不会中断。 | |
| 313 | - | |
| 314 | ---- | |
| 315 | - | |
| 316 | -### 4. 查询与 Suggestion 中的使用建议 | |
| 317 | - | |
| 318 | -#### 4.1 主搜索(Search API) | |
| 319 | - | |
| 320 | -在 `search/query_config.py` 或构建 ES 查询时,可以: | |
| 321 | - | |
| 322 | -- 将 `qanchors.{lang}` 作为额外的 `should` 字段参与匹配,并给一个略高的权重,例如: | |
| 323 | - | |
| 324 | -```json | |
| 325 | -{ | |
| 326 | - "multi_match": { | |
| 327 | - "query": "<user_query>", | |
| 328 | - "fields": [ | |
| 329 | - "title.zh^3.0", | |
| 330 | - "brief.zh^1.5", | |
| 331 | - "description.zh^1.0", | |
| 332 | - "vendor.zh^1.5", | |
| 333 | - "category_path.zh^1.5", | |
| 334 | - "category_name_text.zh^1.5", | |
| 335 | - "tags^1.0", | |
| 336 | - "qanchors.zh^2.0" // 建议新增 | |
| 337 | - ] | |
| 338 | - } | |
| 339 | -} | |
| 340 | -``` | |
| 341 | - | |
| 342 | -- 当用户做维度过滤时(例如“只看通勤场景 + 夏季 + 棉质”),可以在 filter 中增加 nested 查询: | |
| 343 | - | |
| 344 | -```json | |
| 345 | -{ | |
| 346 | - "nested": { | |
| 347 | - "path": "enriched_attributes", | |
| 348 | - "query": { | |
| 349 | - "bool": { | |
| 350 | - "must": [ | |
| 351 | - { "term": { "enriched_attributes.lang": "zh" } }, | |
| 352 | - { "term": { "enriched_attributes.name": "usage_scene" } }, | |
| 353 | - { "term": { "enriched_attributes.value": "通勤" } } | |
| 354 | - ] | |
| 355 | - } | |
| 356 | - } | |
| 357 | - } | |
| 358 | -} | |
| 359 | -``` | |
| 360 | - | |
| 361 | -多个维度可以通过多个 nested 子句组合(AND/OR 逻辑与 `specifications` 的设计类似)。 | |
| 362 | - | |
| 363 | -#### 4.2 Suggestion(联想词) | |
| 364 | - | |
| 365 | -现有 `suggestion/builder.py` 已经支持从 `qanchors.{lang}` 中提取候选: | |
| 366 | - | |
| 367 | -```249:287:/home/tw/saas-search/suggestion/builder.py | |
| 368 | - # Step 1: product title/qanchors | |
| 369 | - hits = self._scan_products(tenant_id, batch_size=batch_size) | |
| 370 | - ... | |
| 371 | - title_obj = src.get("title") or {} | |
| 372 | - qanchor_obj = src.get("qanchors") or {} | |
| 373 | - ... | |
| 374 | - for lang in index_languages: | |
| 375 | - ... | |
| 376 | - q_raw = None | |
| 377 | - if isinstance(qanchor_obj, dict): | |
| 378 | - q_raw = qanchor_obj.get(lang) | |
| 379 | - for q_text in self._split_qanchors(q_raw): | |
| 380 | - text_norm = self._normalize_text(q_text) | |
| 381 | - if self._looks_noise(text_norm): | |
| 382 | - continue | |
| 383 | - key = (lang, text_norm) | |
| 384 | - c = key_to_candidate.get(key) | |
| 385 | - if c is None: | |
| 386 | - c = SuggestionCandidate(text=q_text, text_norm=text_norm, lang=lang) | |
| 387 | - key_to_candidate[key] = c | |
| 388 | - c.add_product("qanchor", spu_id=spu_id, score=product_score + 0.6) | |
| 389 | -``` | |
| 390 | - | |
| 391 | -- `_split_qanchors` 使用与索引端一致的分隔符集合,确保: | |
| 392 | - - 无论 LLM 用逗号、分号还是换行分隔,只要符合约定,都能被拆成单独候选词; | |
| 393 | -- `add_product("qanchor", ...)` 会: | |
| 394 | - - 将来源标记为 `qanchor`; | |
| 395 | - - 在排序打分时,`qanchor` 命中会比纯 `title` 更有权重。 | |
| 396 | - | |
| 397 | ---- | |
| 398 | - | |
| 399 | -### 5. 总结与扩展方向 | |
| 400 | - | |
| 401 | -1. **功能定位**: | |
| 402 | - - `qanchors.{lang}`:更好地贴近用户真实查询词,用于召回与 suggestion; | |
| 403 | - - `enriched_attributes`:以结构化形式承载 LLM 抽取的语义维度,用于 filter / facet。 | |
| 404 | -2. **多语言对齐**: | |
| 405 | - - 完全复用租户级 `index_languages` 配置; | |
| 406 | - - 对每种语言单独生成锚文本与语义属性,不互相混用。 | |
| 407 | -3. **默认开启 / 自动降级**: | |
| 408 | - - 索引流程始终可用; | |
| 409 | - - 当 LLM/配置异常时,只是“缺少增强特征”,不影响基础搜索能力。 | |
| 410 | -4. **未来扩展**: | |
| 411 | - - 可以在 `dim_keys` 中新增维度名(如 `style`, `benefit` 等),只要在 prompt 与解析逻辑中增加对应列即可; | |
| 412 | - - 可以为 `enriched_attributes` 增加额外字段(如 `confidence`、`source`),用于更精细的控制(当前 mapping 为简单版)。 | |
| 413 | - | |
| 414 | -如需在查询层面增加基于 `enriched_attributes` 的统一 DSL(类似 `specifications` 的过滤/分面规则),推荐在 `docs/搜索API对接指南-01-搜索接口.md` 或 `docs/搜索API对接指南-08-数据模型与字段速查.md` 中新增一节,并在 `search/es_query_builder.py` 里封装构造逻辑,避免前端直接拼 nested 查询。 | |
| 7 | +- `search_products` mapping 仍保留 `qanchors`、`enriched_attributes`、`enriched_tags`、`enriched_taxonomy_attributes` 字段,便于外部服务继续产出并写入。 | |
| 8 | +- `suggestion/builder.py` 等消费侧仍会读取 ES 中已有的 `qanchors`。 | |
| 9 | +- `/indexer/build-docs`、`/indexer/build-docs-from-db`、`/indexer/reindex`、`/indexer/index` 只负责基础文档构建,不再调用本地 LLM 富化。 | |
| 415 | 10 | |
| 11 | +如需这些字段,请在独立内容理解服务中生成,并由上游索引程序自行合并到最终 ES 文档。 | ... | ... |
indexer/README.md
| ... | ... | @@ -67,7 +67,7 @@ |
| 67 | 67 | |
| 68 | 68 | - ES 文档结构 `ProductIndexDocument` çš„å—æ®µç»†èŠ‚ï¼ˆtitle/brief/description/vendor/category_xxx/tags/specifications/skus/embedding ç‰ï¼‰ã€‚ |
| 69 | 69 | - 翻译ã€å‘é‡ç‰å…·ä½“算法逻辑。 |
| 70 | -- qanchors/keywords ç‰æ–°ç‰¹å¾çš„计算。 | |
| 70 | +- `qanchors` ç‰å¤–部内容ç†è§£å—段的生æˆã€‚ | |
| 71 | 71 | |
| 72 | 72 | **æ–°èŒè´£è¾¹ç•Œ**: |
| 73 | 73 | Java åªè´Ÿè´£â€œ**选出è¦ç´¢å¼•çš„ SPU + 从 MySQL 拉å–åŽŸå§‹æ•°æ® + 调用 Python æœåŠ¡**(或交给 Python åšå®Œæ•´ç´¢å¼•)â€ã€‚ |
| ... | ... | @@ -81,7 +81,7 @@ Java åªè´Ÿè´£â€œ**选出è¦ç´¢å¼•çš„ SPU + 从 MySQL 拉å–åŽŸå§‹æ•°æ® + è°ƒç” |
| 81 | 81 | - 输入:**MySQL 基础数æ®**(`shoplazza_product_spu/sku/option/category/image` ç‰ï¼‰ã€‚ |
| 82 | 82 | - 输出:**ç¬¦åˆ `mappings/search_products.json` çš„ doc 列表**,包括: |
| 83 | 83 | - 多è¯è¨€æ–‡æœ¬å—段:`title.*`, `brief.*`, `description.*`, `vendor.*`, `category_path.*`, `category_name_text.*`ï¼› |
| 84 | - - 算法特å¾ï¼š`title_embedding`, `image_embedding`, `qanchors.*`, `keywords.*`ï¼ˆæœªæ¥æ‰©å±•); | |
| 84 | + - 算法特å¾ï¼š`title_embedding`, `image_embedding`ï¼› | |
| 85 | 85 | - ç»“æž„åŒ–å—æ®µï¼š`tags`, `specifications`, `skus`, `min_price`, `max_price`, `compare_at_price`, `total_inventory`, `sales` ç‰ã€‚ |
| 86 | 86 | - é™„åŠ ï¼š |
| 87 | 87 | - 翻译调用 & **Redis 缓å˜**(继承 Java çš„ key 组织和 TTL ç–略); |
| ... | ... | @@ -370,14 +370,7 @@ if spu.tags: |
| 370 | 370 | |
| 371 | 371 | ### 7.2 qanchors / keywords 扩展 |
| 372 | 372 | |
| 373 | -- å½“å‰ Java ä¸ `qanchors` å—æ®µç»“构已å˜åœ¨ï¼Œä½†æœªèµ‹å€¼ï¼› | |
| 374 | -- 设计建议: | |
| 375 | - - 在 Python 侧基于: | |
| 376 | - - æ ‡é¢˜ / brief / description / tags / 类目ç‰ï¼Œåš**查询锚点**抽å–ï¼› | |
| 377 | - - 按与 `title/keywords` 类似的多è¯è¨€ç»“构写入 `qanchors.{lang}`ï¼› | |
| 378 | - - 翻译ç–ç•¥å¯é€‰ï¼š | |
| 379 | - - 在生æˆé”šç‚¹åŽå†è°ƒç”¨ç¿»è¯‘ï¼› | |
| 380 | - - 或使用原始文本的翻译结果组åˆã€‚ | |
| 373 | +该能力已è¿ç§»åˆ°ç‹¬ç«‹å†…容ç†è§£æœåŠ¡ã€‚æœ¬ä»“åº“ä»ä¿ç•™å—段模型与消费侧能力,但ä¸å†è´Ÿè´£åœ¨ indexer å†…éƒ¨ç”Ÿæˆ `qanchors` / `enriched_*`。 | |
| 381 | 374 | |
| 382 | 375 | --- |
| 383 | 376 | |
| ... | ... | @@ -436,8 +429,6 @@ if spu.tags: |
| 436 | 429 | "spu_id": "1", |
| 437 | 430 | "tenant_id": "123", |
| 438 | 431 | "title": { "en": "...", "zh": "...", ... }, |
| 439 | - "qanchors": { ... }, | |
| 440 | - "keywords": { ... }, | |
| 441 | 432 | "brief": { ... }, |
| 442 | 433 | "description": { ... }, |
| 443 | 434 | "vendor": { ... }, |
| ... | ... | @@ -496,7 +487,7 @@ if spu.tags: |
| 496 | 487 | - **ä¿ç•™çŽ°æœ‰ Java 调度 & æ•°æ®åŒæ¥èƒ½åŠ›**,ä¸ç ´å已有全é‡/增é‡ä»»åŠ¡å’Œ MQ 削峰; |
| 497 | 488 | - **把 ES 文档结构ã€å¤šè¯è¨€é€»è¾‘ã€ç¿»è¯‘与å‘é‡ç‰ç®—法能力全部收拢到 Python 索引富化模å—**,实现“å•一 ownerâ€ï¼› |
| 498 | 489 | - **完全继承 Java 现有的翻译缓å˜ç–ç•¥**(Redis key & TTL & 维度),ä¿è¯è¡Œä¸ºä¸Žæ€§èƒ½çš„一致性; |
| 499 | -- **为未æ¥å—段扩展(qanchorsã€æ›´å¤š tags/特å¾ï¼‰é¢„留清晰路径**:仅需在 Python 侧新增逻辑和 mapping,ä¸å†æ‹‰ Java 入伙。 | |
| 490 | +- **为未æ¥å—段扩展(包括外部内容ç†è§£å—段接入)预留清晰路径**ï¼šå—æ®µæ¨¡åž‹å¯ç»§ç»ä¿ç•™ï¼Œä½†ç”ŸæˆèŒè´£å¯ç‹¬ç«‹æ¼”进。 | |
| 500 | 491 | |
| 501 | 492 | --- |
| 502 | 493 | |
| ... | ... | @@ -514,6 +505,7 @@ if spu.tags: |
| 514 | 505 | - **构建文档(æ£å¼ä½¿ç”¨ï¼‰**:`POST /indexer/build-docs` |
| 515 | 506 | - å…¥å‚:`tenant_id + items[ { spu, skus, options } ]` |
| 516 | 507 | - 输出:`docs` 数组,æ¯ä¸ªå…ƒç´ 是完整 ES docï¼Œä¸æŸ¥åº“ã€ä¸å†™ ES。 |
| 508 | + - 注æ„:当å‰ä¸å†å†…ç½®ç”Ÿæˆ `qanchors` / `enriched_*`ï¼›å¦‚éœ€è¿™äº›å—æ®µï¼Œè¯·ç”±ç‹¬ç«‹å†…容ç†è§£æœåŠ¡ç”ŸæˆåŽè‡ªè¡Œåˆå¹¶ã€‚ | |
| 517 | 509 | |
| 518 | 510 | - **构建文档(测试用,内部查库)**:`POST /indexer/build-docs-from-db` |
| 519 | 511 | - å…¥å‚:`{"tenant_id": "...", "spu_ids": ["..."]}` | ... | ... |
indexer/document_transformer.py
| ... | ... | @@ -12,7 +12,6 @@ import pandas as pd |
| 12 | 12 | import numpy as np |
| 13 | 13 | import logging |
| 14 | 14 | from typing import Dict, Any, Optional, List |
| 15 | -from indexer.product_enrich import build_index_content_fields | |
| 16 | 15 | |
| 17 | 16 | logger = logging.getLogger(__name__) |
| 18 | 17 | |
| ... | ... | @@ -113,7 +112,6 @@ class SPUDocumentTransformer: |
| 113 | 112 | spu_row: pd.Series, |
| 114 | 113 | skus: pd.DataFrame, |
| 115 | 114 | options: pd.DataFrame, |
| 116 | - fill_llm_attributes: bool = True, | |
| 117 | 115 | ) -> Optional[Dict[str, Any]]: |
| 118 | 116 | """ |
| 119 | 117 | 将单个SPU行和其SKUs转换为ES文档。 |
| ... | ... | @@ -228,85 +226,8 @@ class SPUDocumentTransformer: |
| 228 | 226 | else: |
| 229 | 227 | doc['update_time'] = str(update_time) |
| 230 | 228 | |
| 231 | - # 基于 LLM 的锚文本与语义属性(默认开启,失败时仅记录日志) | |
| 232 | - # 注意:批处理场景(build-docs / bulk / incremental)应优先在外层攒批, | |
| 233 | - # 再调用 fill_llm_attributes_batch(),避免逐条调用 LLM。 | |
| 234 | - if fill_llm_attributes: | |
| 235 | - self._fill_llm_attributes(doc, spu_row) | |
| 236 | - | |
| 237 | 229 | return doc |
| 238 | 230 | |
| 239 | - def fill_llm_attributes_batch(self, docs: List[Dict[str, Any]], spu_rows: List[pd.Series]) -> None: | |
| 240 | - """ | |
| 241 | - 批量调用 LLM,为一批 doc 填充: | |
| 242 | - - qanchors.{lang} | |
| 243 | - - enriched_tags.{lang} | |
| 244 | - - enriched_attributes[].value.{lang} | |
| 245 | - - enriched_taxonomy_attributes[].value.{lang} | |
| 246 | - | |
| 247 | - 设计目标: | |
| 248 | - - 尽可能攒批调用 LLM; | |
| 249 | - - 单次 LLM 调用最多 20 条(由 analyze_products 内部强制 cap 并自动拆批)。 | |
| 250 | - """ | |
| 251 | - if not docs or not spu_rows or len(docs) != len(spu_rows): | |
| 252 | - return | |
| 253 | - | |
| 254 | - id_to_idx: Dict[str, int] = {} | |
| 255 | - items: List[Dict[str, str]] = [] | |
| 256 | - for i, row in enumerate(spu_rows): | |
| 257 | - raw_id = row.get("id") | |
| 258 | - spu_id = "" if raw_id is None else str(raw_id).strip() | |
| 259 | - title = str(row.get("title") or "").strip() | |
| 260 | - if not spu_id or not title: | |
| 261 | - continue | |
| 262 | - id_to_idx[spu_id] = i | |
| 263 | - items.append( | |
| 264 | - { | |
| 265 | - "id": spu_id, | |
| 266 | - "title": title, | |
| 267 | - "brief": str(row.get("brief") or "").strip(), | |
| 268 | - "description": str(row.get("description") or "").strip(), | |
| 269 | - "image_url": str(row.get("image_src") or "").strip(), | |
| 270 | - } | |
| 271 | - ) | |
| 272 | - if not items: | |
| 273 | - return | |
| 274 | - | |
| 275 | - tenant_id = str(docs[0].get("tenant_id") or "").strip() or None | |
| 276 | - try: | |
| 277 | - # TODO: 从数据库读取该 tenant 的真实行业,并据此替换当前默认的 apparel profile。 | |
| 278 | - results = build_index_content_fields( | |
| 279 | - items=items, | |
| 280 | - tenant_id=tenant_id, | |
| 281 | - category_taxonomy_profile="apparel", | |
| 282 | - ) | |
| 283 | - except Exception as e: | |
| 284 | - logger.warning("LLM batch attribute fill failed: %s", e) | |
| 285 | - return | |
| 286 | - | |
| 287 | - for result in results: | |
| 288 | - spu_id = str(result.get("id") or "").strip() | |
| 289 | - if not spu_id: | |
| 290 | - continue | |
| 291 | - idx = id_to_idx.get(spu_id) | |
| 292 | - if idx is None: | |
| 293 | - continue | |
| 294 | - self._apply_content_enrichment(docs[idx], result) | |
| 295 | - | |
| 296 | - def _apply_content_enrichment(self, doc: Dict[str, Any], enrichment: Dict[str, Any]) -> None: | |
| 297 | - """将 product_enrich 产出的 ES-ready 内容字段写入 doc。""" | |
| 298 | - try: | |
| 299 | - if enrichment.get("qanchors"): | |
| 300 | - doc["qanchors"] = enrichment["qanchors"] | |
| 301 | - if enrichment.get("enriched_tags"): | |
| 302 | - doc["enriched_tags"] = enrichment["enriched_tags"] | |
| 303 | - if enrichment.get("enriched_attributes"): | |
| 304 | - doc["enriched_attributes"] = enrichment["enriched_attributes"] | |
| 305 | - if enrichment.get("enriched_taxonomy_attributes"): | |
| 306 | - doc["enriched_taxonomy_attributes"] = enrichment["enriched_taxonomy_attributes"] | |
| 307 | - except Exception as e: | |
| 308 | - logger.warning("Failed to apply enrichment to doc (spu_id=%s): %s", doc.get("spu_id"), e) | |
| 309 | - | |
| 310 | 231 | def _fill_text_fields( |
| 311 | 232 | self, |
| 312 | 233 | doc: Dict[str, Any], |
| ... | ... | @@ -660,41 +581,6 @@ class SPUDocumentTransformer: |
| 660 | 581 | else: |
| 661 | 582 | doc['option3_values'] = [] |
| 662 | 583 | |
| 663 | - def _fill_llm_attributes(self, doc: Dict[str, Any], spu_row: pd.Series) -> None: | |
| 664 | - """ | |
| 665 | - 调用 indexer.product_enrich 的高层内容理解入口,为当前 SPU 填充: | |
| 666 | - - qanchors.{lang} | |
| 667 | - - enriched_tags.{lang} | |
| 668 | - - enriched_attributes[].value.{lang} | |
| 669 | - """ | |
| 670 | - spu_id = str(spu_row.get("id") or "").strip() | |
| 671 | - title = str(spu_row.get("title") or "").strip() | |
| 672 | - if not spu_id or not title: | |
| 673 | - return | |
| 674 | - | |
| 675 | - tenant_id = doc.get("tenant_id") | |
| 676 | - try: | |
| 677 | - # TODO: 从数据库读取该 tenant 的真实行业,并据此替换当前默认的 apparel profile。 | |
| 678 | - results = build_index_content_fields( | |
| 679 | - items=[ | |
| 680 | - { | |
| 681 | - "id": spu_id, | |
| 682 | - "title": title, | |
| 683 | - "brief": str(spu_row.get("brief") or "").strip(), | |
| 684 | - "description": str(spu_row.get("description") or "").strip(), | |
| 685 | - "image_url": str(spu_row.get("image_src") or "").strip(), | |
| 686 | - } | |
| 687 | - ], | |
| 688 | - tenant_id=str(tenant_id), | |
| 689 | - category_taxonomy_profile="apparel", | |
| 690 | - ) | |
| 691 | - except Exception as e: | |
| 692 | - logger.warning("LLM attribute fill failed for SPU %s: %s", spu_id, e) | |
| 693 | - return | |
| 694 | - | |
| 695 | - if results: | |
| 696 | - self._apply_content_enrichment(doc, results[0]) | |
| 697 | - | |
| 698 | 584 | def _transform_sku_row(self, sku_row: pd.Series, option_name_map: Dict[int, str] = None) -> Optional[Dict[str, Any]]: |
| 699 | 585 | """ |
| 700 | 586 | 将SKU行转换为SKU对象。 | ... | ... |
indexer/incremental_service.py
| ... | ... | @@ -584,7 +584,6 @@ class IncrementalIndexerService: |
| 584 | 584 | transformer, encoder, enable_embedding = self._get_transformer_bundle(tenant_id) |
| 585 | 585 | |
| 586 | 586 | # 按输入顺序处理 active SPUs |
| 587 | - doc_spu_rows: List[pd.Series] = [] | |
| 588 | 587 | for spu_id in spu_ids: |
| 589 | 588 | try: |
| 590 | 589 | spu_id_int = int(spu_id) |
| ... | ... | @@ -603,7 +602,6 @@ class IncrementalIndexerService: |
| 603 | 602 | spu_row=spu_row, |
| 604 | 603 | skus=skus_for_spu, |
| 605 | 604 | options=opts_for_spu, |
| 606 | - fill_llm_attributes=False, | |
| 607 | 605 | ) |
| 608 | 606 | if doc is None: |
| 609 | 607 | error_msg = "SPU transform returned None" |
| ... | ... | @@ -612,14 +610,6 @@ class IncrementalIndexerService: |
| 612 | 610 | continue |
| 613 | 611 | |
| 614 | 612 | documents.append((spu_id, doc)) |
| 615 | - doc_spu_rows.append(spu_row) | |
| 616 | - | |
| 617 | - # 批量填充 LLM 字段(尽量攒批,每次最多 20 条;失败仅 warning,不影响主流程) | |
| 618 | - try: | |
| 619 | - if documents and doc_spu_rows: | |
| 620 | - transformer.fill_llm_attributes_batch([d for _, d in documents], doc_spu_rows) | |
| 621 | - except Exception as e: | |
| 622 | - logger.warning("[IncrementalIndexing] Batch LLM fill failed: %s", e) | |
| 623 | 613 | |
| 624 | 614 | # 批量生成 embedding(保持翻译逻辑不变;embedding 走缓存) |
| 625 | 615 | if enable_embedding and encoder and documents: | ... | ... |
indexer/product_enrich.py deleted
| ... | ... | @@ -1,1421 +0,0 @@ |
| 1 | -#!/usr/bin/env python3 | |
| 2 | -""" | |
| 3 | -商品内容理解与属性补充模块(product_enrich) | |
| 4 | - | |
| 5 | -提供基于 LLM 的商品锚文本 / 语义属性 / 标签等分析能力, | |
| 6 | -供 indexer 与 API 在内存中调用(不再负责 CSV 读写)。 | |
| 7 | -""" | |
| 8 | - | |
| 9 | -import os | |
| 10 | -import json | |
| 11 | -import logging | |
| 12 | -import re | |
| 13 | -import time | |
| 14 | -import hashlib | |
| 15 | -import uuid | |
| 16 | -import threading | |
| 17 | -from dataclasses import dataclass, field | |
| 18 | -from collections import OrderedDict | |
| 19 | -from datetime import datetime | |
| 20 | -from concurrent.futures import ThreadPoolExecutor | |
| 21 | -from typing import List, Dict, Tuple, Any, Optional, FrozenSet | |
| 22 | - | |
| 23 | -import redis | |
| 24 | -import requests | |
| 25 | -from pathlib import Path | |
| 26 | - | |
| 27 | -from config.loader import get_app_config | |
| 28 | -from config.tenant_config_loader import SOURCE_LANG_CODE_MAP | |
| 29 | -from indexer.product_enrich_prompts import ( | |
| 30 | - SYSTEM_MESSAGE, | |
| 31 | - USER_INSTRUCTION_TEMPLATE, | |
| 32 | - LANGUAGE_MARKDOWN_TABLE_HEADERS, | |
| 33 | - SHARED_ANALYSIS_INSTRUCTION, | |
| 34 | - CATEGORY_TAXONOMY_PROFILES, | |
| 35 | -) | |
| 36 | - | |
| 37 | -# 配置 | |
| 38 | -BATCH_SIZE = 20 | |
| 39 | -# enrich-content LLM 批次并发 worker 上限(线程池;仅对 uncached batch 并发) | |
| 40 | -_APP_CONFIG = get_app_config() | |
| 41 | -CONTENT_UNDERSTANDING_MAX_WORKERS = int(_APP_CONFIG.product_enrich.max_workers) | |
| 42 | -# 华北2(北京):https://dashscope.aliyuncs.com/compatible-mode/v1 | |
| 43 | -# 新加坡:https://dashscope-intl.aliyuncs.com/compatible-mode/v1 | |
| 44 | -# 美国(弗吉尼亚):https://dashscope-us.aliyuncs.com/compatible-mode/v1 | |
| 45 | -API_BASE_URL = "https://dashscope-us.aliyuncs.com/compatible-mode/v1" | |
| 46 | -MODEL_NAME = "qwen-flash" | |
| 47 | -API_KEY = os.environ.get("DASHSCOPE_API_KEY") | |
| 48 | -MAX_RETRIES = 3 | |
| 49 | -RETRY_DELAY = 5 # 秒 | |
| 50 | -REQUEST_TIMEOUT = 180 # 秒 | |
| 51 | -LOGGED_SHARED_CONTEXT_CACHE_SIZE = 256 | |
| 52 | -PROMPT_INPUT_MIN_ZH_CHARS = 20 | |
| 53 | -PROMPT_INPUT_MAX_ZH_CHARS = 100 | |
| 54 | -PROMPT_INPUT_MIN_WORDS = 16 | |
| 55 | -PROMPT_INPUT_MAX_WORDS = 80 | |
| 56 | - | |
| 57 | -# 日志路径 | |
| 58 | -OUTPUT_DIR = Path("output_logs") | |
| 59 | -LOG_DIR = OUTPUT_DIR / "logs" | |
| 60 | - | |
| 61 | -# 设置独立日志(不影响全局 indexer.log) | |
| 62 | -LOG_DIR.mkdir(parents=True, exist_ok=True) | |
| 63 | -timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") | |
| 64 | -log_file = LOG_DIR / f"product_enrich_{timestamp}.log" | |
| 65 | -verbose_log_file = LOG_DIR / "product_enrich_verbose.log" | |
| 66 | -_logged_shared_context_keys: "OrderedDict[str, None]" = OrderedDict() | |
| 67 | -_logged_shared_context_lock = threading.Lock() | |
| 68 | - | |
| 69 | -_content_understanding_executor: Optional[ThreadPoolExecutor] = None | |
| 70 | -_content_understanding_executor_lock = threading.Lock() | |
| 71 | - | |
| 72 | - | |
| 73 | -def _get_content_understanding_executor() -> ThreadPoolExecutor: | |
| 74 | - """ | |
| 75 | - 使用模块级单例线程池,避免同一进程内多次请求叠加创建线程池导致并发失控。 | |
| 76 | - """ | |
| 77 | - global _content_understanding_executor | |
| 78 | - with _content_understanding_executor_lock: | |
| 79 | - if _content_understanding_executor is None: | |
| 80 | - _content_understanding_executor = ThreadPoolExecutor( | |
| 81 | - max_workers=CONTENT_UNDERSTANDING_MAX_WORKERS, | |
| 82 | - thread_name_prefix="product-enrich-llm", | |
| 83 | - ) | |
| 84 | - return _content_understanding_executor | |
| 85 | - | |
| 86 | -# 主日志 logger:执行流程、批次信息等 | |
| 87 | -logger = logging.getLogger("product_enrich") | |
| 88 | -logger.setLevel(logging.INFO) | |
| 89 | - | |
| 90 | -if not logger.handlers: | |
| 91 | - formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s") | |
| 92 | - | |
| 93 | - file_handler = logging.FileHandler(log_file, encoding="utf-8") | |
| 94 | - file_handler.setFormatter(formatter) | |
| 95 | - | |
| 96 | - stream_handler = logging.StreamHandler() | |
| 97 | - stream_handler.setFormatter(formatter) | |
| 98 | - | |
| 99 | - logger.addHandler(file_handler) | |
| 100 | - logger.addHandler(stream_handler) | |
| 101 | - | |
| 102 | - # 避免日志向根 logger 传播,防止写入 logs/indexer.log 等其他文件 | |
| 103 | - logger.propagate = False | |
| 104 | - | |
| 105 | -# 详尽日志 logger:专门记录 LLM 请求与响应 | |
| 106 | -verbose_logger = logging.getLogger("product_enrich_verbose") | |
| 107 | -verbose_logger.setLevel(logging.INFO) | |
| 108 | - | |
| 109 | -if not verbose_logger.handlers: | |
| 110 | - verbose_formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(message)s") | |
| 111 | - verbose_file_handler = logging.FileHandler(verbose_log_file, encoding="utf-8") | |
| 112 | - verbose_file_handler.setFormatter(verbose_formatter) | |
| 113 | - verbose_logger.addHandler(verbose_file_handler) | |
| 114 | - verbose_logger.propagate = False | |
| 115 | - | |
| 116 | -logger.info("Verbose LLM logs are written to: %s", verbose_log_file) | |
| 117 | - | |
| 118 | - | |
| 119 | -# Redis 缓存(用于 anchors / 语义属性) | |
| 120 | -_REDIS_CONFIG = _APP_CONFIG.infrastructure.redis | |
| 121 | -ANCHOR_CACHE_PREFIX = _REDIS_CONFIG.anchor_cache_prefix | |
| 122 | -ANCHOR_CACHE_EXPIRE_DAYS = int(_REDIS_CONFIG.anchor_cache_expire_days) | |
| 123 | -_anchor_redis: Optional[redis.Redis] = None | |
| 124 | - | |
| 125 | -try: | |
| 126 | - _anchor_redis = redis.Redis( | |
| 127 | - host=_REDIS_CONFIG.host, | |
| 128 | - port=_REDIS_CONFIG.port, | |
| 129 | - password=_REDIS_CONFIG.password, | |
| 130 | - decode_responses=True, | |
| 131 | - socket_timeout=_REDIS_CONFIG.socket_timeout, | |
| 132 | - socket_connect_timeout=_REDIS_CONFIG.socket_connect_timeout, | |
| 133 | - retry_on_timeout=_REDIS_CONFIG.retry_on_timeout, | |
| 134 | - health_check_interval=10, | |
| 135 | - ) | |
| 136 | - _anchor_redis.ping() | |
| 137 | - logger.info("Redis cache initialized for product anchors and semantic attributes") | |
| 138 | -except Exception as e: | |
| 139 | - logger.warning(f"Failed to initialize Redis for anchors cache: {e}") | |
| 140 | - _anchor_redis = None | |
| 141 | - | |
| 142 | -_missing_prompt_langs = sorted(set(SOURCE_LANG_CODE_MAP) - set(LANGUAGE_MARKDOWN_TABLE_HEADERS)) | |
| 143 | -if _missing_prompt_langs: | |
| 144 | - raise RuntimeError( | |
| 145 | - f"Missing product_enrich prompt config for languages: {_missing_prompt_langs}" | |
| 146 | - ) | |
| 147 | - | |
| 148 | - | |
| 149 | -# 多值字段分隔 | |
| 150 | -_MULTI_VALUE_FIELD_SPLIT_RE = re.compile(r"[,、,;|/\n\t]+") | |
| 151 | -# 表格单元格中视为「无内容」的占位 | |
| 152 | -_MARKDOWN_EMPTY_CELL_LITERALS: Tuple[str, ...] = ("-","–", "—", "none", "null", "n/a", "无") | |
| 153 | -_MARKDOWN_EMPTY_CELL_TOKENS_CF: FrozenSet[str] = frozenset( | |
| 154 | - lit.casefold() for lit in _MARKDOWN_EMPTY_CELL_LITERALS | |
| 155 | -) | |
| 156 | - | |
| 157 | -def _normalize_markdown_table_cell(raw: Optional[str]) -> str: | |
| 158 | - """strip;将占位符统一视为空字符串。""" | |
| 159 | - s = str(raw or "").strip() | |
| 160 | - if not s: | |
| 161 | - return "" | |
| 162 | - if s.casefold() in _MARKDOWN_EMPTY_CELL_TOKENS_CF: | |
| 163 | - return "" | |
| 164 | - return s | |
| 165 | -_CORE_INDEX_LANGUAGES = ("zh", "en") | |
| 166 | -_DEFAULT_ENRICHMENT_SCOPES = ("generic", "category_taxonomy") | |
| 167 | -_DEFAULT_CATEGORY_TAXONOMY_PROFILE = "apparel" | |
| 168 | -_CONTENT_ANALYSIS_ATTRIBUTE_FIELD_MAP = ( | |
| 169 | - ("tags", "enriched_tags"), | |
| 170 | - ("target_audience", "target_audience"), | |
| 171 | - ("usage_scene", "usage_scene"), | |
| 172 | - ("season", "season"), | |
| 173 | - ("key_attributes", "key_attributes"), | |
| 174 | - ("material", "material"), | |
| 175 | - ("features", "features"), | |
| 176 | -) | |
| 177 | -_CONTENT_ANALYSIS_RESULT_FIELDS = ( | |
| 178 | - "title", | |
| 179 | - "category_path", | |
| 180 | - "tags", | |
| 181 | - "target_audience", | |
| 182 | - "usage_scene", | |
| 183 | - "season", | |
| 184 | - "key_attributes", | |
| 185 | - "material", | |
| 186 | - "features", | |
| 187 | - "anchor_text", | |
| 188 | -) | |
| 189 | -_CONTENT_ANALYSIS_MEANINGFUL_FIELDS = ( | |
| 190 | - "tags", | |
| 191 | - "target_audience", | |
| 192 | - "usage_scene", | |
| 193 | - "season", | |
| 194 | - "key_attributes", | |
| 195 | - "material", | |
| 196 | - "features", | |
| 197 | - "anchor_text", | |
| 198 | -) | |
| 199 | -_CONTENT_ANALYSIS_FIELD_ALIASES = { | |
| 200 | - "tags": ("tags", "enriched_tags"), | |
| 201 | -} | |
| 202 | -_CONTENT_ANALYSIS_QUALITY_FIELDS = ("title", "category_path", "anchor_text") | |
| 203 | - | |
| 204 | - | |
| 205 | -@dataclass(frozen=True) | |
| 206 | -class AnalysisSchema: | |
| 207 | - name: str | |
| 208 | - shared_instruction: str | |
| 209 | - markdown_table_headers: Dict[str, List[str]] | |
| 210 | - result_fields: Tuple[str, ...] | |
| 211 | - meaningful_fields: Tuple[str, ...] | |
| 212 | - cache_version: str = "v1" | |
| 213 | - field_aliases: Dict[str, Tuple[str, ...]] = field(default_factory=dict) | |
| 214 | - quality_fields: Tuple[str, ...] = () | |
| 215 | - | |
| 216 | - def get_headers(self, target_lang: str) -> Optional[List[str]]: | |
| 217 | - return self.markdown_table_headers.get(target_lang) | |
| 218 | - | |
| 219 | - | |
| 220 | -_ANALYSIS_SCHEMAS: Dict[str, AnalysisSchema] = { | |
| 221 | - "content": AnalysisSchema( | |
| 222 | - name="content", | |
| 223 | - shared_instruction=SHARED_ANALYSIS_INSTRUCTION, | |
| 224 | - markdown_table_headers=LANGUAGE_MARKDOWN_TABLE_HEADERS, | |
| 225 | - result_fields=_CONTENT_ANALYSIS_RESULT_FIELDS, | |
| 226 | - meaningful_fields=_CONTENT_ANALYSIS_MEANINGFUL_FIELDS, | |
| 227 | - cache_version="v2", | |
| 228 | - field_aliases=_CONTENT_ANALYSIS_FIELD_ALIASES, | |
| 229 | - quality_fields=_CONTENT_ANALYSIS_QUALITY_FIELDS, | |
| 230 | - ), | |
| 231 | -} | |
| 232 | - | |
| 233 | -def _build_taxonomy_profile_schema(profile: str, config: Dict[str, Any]) -> AnalysisSchema: | |
| 234 | - return AnalysisSchema( | |
| 235 | - name=f"taxonomy:{profile}", | |
| 236 | - shared_instruction=config["shared_instruction"], | |
| 237 | - markdown_table_headers=config["markdown_table_headers"], | |
| 238 | - result_fields=tuple(field["key"] for field in config["fields"]), | |
| 239 | - meaningful_fields=tuple(field["key"] for field in config["fields"]), | |
| 240 | - cache_version="v1", | |
| 241 | - ) | |
| 242 | - | |
| 243 | - | |
| 244 | -_CATEGORY_TAXONOMY_PROFILE_SCHEMAS: Dict[str, AnalysisSchema] = { | |
| 245 | - profile: _build_taxonomy_profile_schema(profile, config) | |
| 246 | - for profile, config in CATEGORY_TAXONOMY_PROFILES.items() | |
| 247 | -} | |
| 248 | - | |
| 249 | -_CATEGORY_TAXONOMY_PROFILE_ATTRIBUTE_FIELD_MAPS: Dict[str, Tuple[Tuple[str, str], ...]] = { | |
| 250 | - profile: tuple((field["key"], field["label"]) for field in config["fields"]) | |
| 251 | - for profile, config in CATEGORY_TAXONOMY_PROFILES.items() | |
| 252 | -} | |
| 253 | - | |
| 254 | - | |
| 255 | -def get_supported_category_taxonomy_profiles() -> Tuple[str, ...]: | |
| 256 | - return tuple(_CATEGORY_TAXONOMY_PROFILE_SCHEMAS.keys()) | |
| 257 | - | |
| 258 | - | |
| 259 | -def _normalize_category_taxonomy_profile(category_taxonomy_profile: Optional[str] = None) -> str: | |
| 260 | - profile = str(category_taxonomy_profile or _DEFAULT_CATEGORY_TAXONOMY_PROFILE).strip() | |
| 261 | - if profile not in _CATEGORY_TAXONOMY_PROFILE_SCHEMAS: | |
| 262 | - supported = ", ".join(get_supported_category_taxonomy_profiles()) | |
| 263 | - raise ValueError( | |
| 264 | - f"Unsupported category_taxonomy_profile: {profile}. Supported profiles: {supported}" | |
| 265 | - ) | |
| 266 | - return profile | |
| 267 | - | |
| 268 | - | |
| 269 | -def _get_analysis_schema( | |
| 270 | - analysis_kind: str, | |
| 271 | - *, | |
| 272 | - category_taxonomy_profile: Optional[str] = None, | |
| 273 | -) -> AnalysisSchema: | |
| 274 | - if analysis_kind == "content": | |
| 275 | - return _ANALYSIS_SCHEMAS["content"] | |
| 276 | - if analysis_kind == "taxonomy": | |
| 277 | - profile = _normalize_category_taxonomy_profile(category_taxonomy_profile) | |
| 278 | - return _CATEGORY_TAXONOMY_PROFILE_SCHEMAS[profile] | |
| 279 | - raise ValueError(f"Unsupported analysis_kind: {analysis_kind}") | |
| 280 | - | |
| 281 | - | |
| 282 | -def _get_taxonomy_attribute_field_map( | |
| 283 | - category_taxonomy_profile: Optional[str] = None, | |
| 284 | -) -> Tuple[Tuple[str, str], ...]: | |
| 285 | - profile = _normalize_category_taxonomy_profile(category_taxonomy_profile) | |
| 286 | - return _CATEGORY_TAXONOMY_PROFILE_ATTRIBUTE_FIELD_MAPS[profile] | |
| 287 | - | |
| 288 | - | |
| 289 | -def _normalize_enrichment_scopes( | |
| 290 | - enrichment_scopes: Optional[List[str]] = None, | |
| 291 | -) -> Tuple[str, ...]: | |
| 292 | - requested = _DEFAULT_ENRICHMENT_SCOPES if not enrichment_scopes else tuple(enrichment_scopes) | |
| 293 | - normalized: List[str] = [] | |
| 294 | - seen = set() | |
| 295 | - for enrichment_scope in requested: | |
| 296 | - scope = str(enrichment_scope).strip() | |
| 297 | - if scope not in {"generic", "category_taxonomy"}: | |
| 298 | - raise ValueError(f"Unsupported enrichment_scope: {scope}") | |
| 299 | - if scope in seen: | |
| 300 | - continue | |
| 301 | - seen.add(scope) | |
| 302 | - normalized.append(scope) | |
| 303 | - return tuple(normalized) | |
| 304 | - | |
| 305 | - | |
| 306 | -def split_multi_value_field(text: Optional[str]) -> List[str]: | |
| 307 | - """将 LLM/业务中的多值字符串拆成短语列表(strip 后去空)。""" | |
| 308 | - if text is None: | |
| 309 | - return [] | |
| 310 | - s = str(text).strip() | |
| 311 | - if not s: | |
| 312 | - return [] | |
| 313 | - return [p.strip() for p in _MULTI_VALUE_FIELD_SPLIT_RE.split(s) if p.strip()] | |
| 314 | - | |
| 315 | - | |
| 316 | -def _append_lang_phrase_map(target: Dict[str, List[str]], lang: str, raw_value: Any) -> None: | |
| 317 | - parts = split_multi_value_field(raw_value) | |
| 318 | - if not parts: | |
| 319 | - return | |
| 320 | - existing = target.get(lang) or [] | |
| 321 | - merged = list(dict.fromkeys([str(x).strip() for x in existing if str(x).strip()] + parts)) | |
| 322 | - if merged: | |
| 323 | - target[lang] = merged | |
| 324 | - | |
| 325 | - | |
| 326 | -def _get_or_create_named_value_entry( | |
| 327 | - target: List[Dict[str, Any]], | |
| 328 | - name: str, | |
| 329 | - *, | |
| 330 | - default_value: Optional[Dict[str, Any]] = None, | |
| 331 | -) -> Dict[str, Any]: | |
| 332 | - for item in target: | |
| 333 | - if item.get("name") == name: | |
| 334 | - value = item.get("value") | |
| 335 | - if isinstance(value, dict): | |
| 336 | - return item | |
| 337 | - break | |
| 338 | - | |
| 339 | - entry = {"name": name, "value": default_value or {}} | |
| 340 | - target.append(entry) | |
| 341 | - return entry | |
| 342 | - | |
| 343 | - | |
| 344 | -def _append_named_lang_phrase_map( | |
| 345 | - target: List[Dict[str, Any]], | |
| 346 | - name: str, | |
| 347 | - lang: str, | |
| 348 | - raw_value: Any, | |
| 349 | -) -> None: | |
| 350 | - entry = _get_or_create_named_value_entry(target, name=name, default_value={}) | |
| 351 | - _append_lang_phrase_map(entry["value"], lang=lang, raw_value=raw_value) | |
| 352 | - | |
| 353 | - | |
| 354 | -def _get_product_id(product: Dict[str, Any]) -> str: | |
| 355 | - return str(product.get("id") or product.get("spu_id") or "").strip() | |
| 356 | - | |
| 357 | - | |
| 358 | -def _get_analysis_field_aliases(field_name: str, schema: AnalysisSchema) -> Tuple[str, ...]: | |
| 359 | - return schema.field_aliases.get(field_name, (field_name,)) | |
| 360 | - | |
| 361 | - | |
| 362 | -def _get_analysis_field_value(row: Dict[str, Any], field_name: str, schema: AnalysisSchema) -> Any: | |
| 363 | - for alias in _get_analysis_field_aliases(field_name, schema): | |
| 364 | - if alias in row: | |
| 365 | - return row.get(alias) | |
| 366 | - return None | |
| 367 | - | |
| 368 | - | |
| 369 | -def _has_meaningful_value(value: Any) -> bool: | |
| 370 | - if value is None: | |
| 371 | - return False | |
| 372 | - if isinstance(value, str): | |
| 373 | - return bool(value.strip()) | |
| 374 | - if isinstance(value, dict): | |
| 375 | - return any(_has_meaningful_value(v) for v in value.values()) | |
| 376 | - if isinstance(value, list): | |
| 377 | - return any(_has_meaningful_value(v) for v in value) | |
| 378 | - return bool(value) | |
| 379 | - | |
| 380 | - | |
| 381 | -def _make_empty_analysis_result( | |
| 382 | - product: Dict[str, Any], | |
| 383 | - target_lang: str, | |
| 384 | - schema: AnalysisSchema, | |
| 385 | - error: Optional[str] = None, | |
| 386 | -) -> Dict[str, Any]: | |
| 387 | - result = { | |
| 388 | - "id": _get_product_id(product), | |
| 389 | - "lang": target_lang, | |
| 390 | - "title_input": str(product.get("title") or "").strip(), | |
| 391 | - } | |
| 392 | - for field in schema.result_fields: | |
| 393 | - result[field] = "" | |
| 394 | - if error: | |
| 395 | - result["error"] = error | |
| 396 | - return result | |
| 397 | - | |
| 398 | - | |
| 399 | -def _normalize_analysis_result( | |
| 400 | - result: Dict[str, Any], | |
| 401 | - product: Dict[str, Any], | |
| 402 | - target_lang: str, | |
| 403 | - schema: AnalysisSchema, | |
| 404 | -) -> Dict[str, Any]: | |
| 405 | - normalized = _make_empty_analysis_result(product, target_lang, schema) | |
| 406 | - if not isinstance(result, dict): | |
| 407 | - return normalized | |
| 408 | - | |
| 409 | - normalized["lang"] = str(result.get("lang") or target_lang).strip() or target_lang | |
| 410 | - normalized["title_input"] = str( | |
| 411 | - product.get("title") or result.get("title_input") or "" | |
| 412 | - ).strip() | |
| 413 | - | |
| 414 | - for field in schema.result_fields: | |
| 415 | - normalized[field] = str(_get_analysis_field_value(result, field, schema) or "").strip() | |
| 416 | - | |
| 417 | - if result.get("error"): | |
| 418 | - normalized["error"] = str(result.get("error")) | |
| 419 | - return normalized | |
| 420 | - | |
| 421 | - | |
| 422 | -def _has_meaningful_analysis_content(result: Dict[str, Any], schema: AnalysisSchema) -> bool: | |
| 423 | - return any(_has_meaningful_value(result.get(field)) for field in schema.meaningful_fields) | |
| 424 | - | |
| 425 | - | |
| 426 | -def _append_analysis_attributes( | |
| 427 | - target: List[Dict[str, Any]], | |
| 428 | - row: Dict[str, Any], | |
| 429 | - lang: str, | |
| 430 | - schema: AnalysisSchema, | |
| 431 | - field_map: Tuple[Tuple[str, str], ...], | |
| 432 | -) -> None: | |
| 433 | - for source_name, output_name in field_map: | |
| 434 | - raw = _get_analysis_field_value(row, source_name, schema) | |
| 435 | - if not raw: | |
| 436 | - continue | |
| 437 | - _append_named_lang_phrase_map( | |
| 438 | - target, | |
| 439 | - name=output_name, | |
| 440 | - lang=lang, | |
| 441 | - raw_value=raw, | |
| 442 | - ) | |
| 443 | - | |
| 444 | - | |
| 445 | -def _apply_index_content_row(result: Dict[str, Any], row: Dict[str, Any], lang: str) -> None: | |
| 446 | - if not row or row.get("error"): | |
| 447 | - return | |
| 448 | - | |
| 449 | - content_schema = _get_analysis_schema("content") | |
| 450 | - anchor_text = str(_get_analysis_field_value(row, "anchor_text", content_schema) or "").strip() | |
| 451 | - if anchor_text: | |
| 452 | - _append_lang_phrase_map(result["qanchors"], lang=lang, raw_value=anchor_text) | |
| 453 | - | |
| 454 | - for source_name, output_name in _CONTENT_ANALYSIS_ATTRIBUTE_FIELD_MAP: | |
| 455 | - raw = _get_analysis_field_value(row, source_name, content_schema) | |
| 456 | - if not raw: | |
| 457 | - continue | |
| 458 | - _append_named_lang_phrase_map( | |
| 459 | - result["enriched_attributes"], | |
| 460 | - name=output_name, | |
| 461 | - lang=lang, | |
| 462 | - raw_value=raw, | |
| 463 | - ) | |
| 464 | - if output_name == "enriched_tags": | |
| 465 | - _append_lang_phrase_map(result["enriched_tags"], lang=lang, raw_value=raw) | |
| 466 | - | |
| 467 | - | |
| 468 | -def _apply_index_taxonomy_row( | |
| 469 | - result: Dict[str, Any], | |
| 470 | - row: Dict[str, Any], | |
| 471 | - lang: str, | |
| 472 | - *, | |
| 473 | - category_taxonomy_profile: Optional[str] = None, | |
| 474 | -) -> None: | |
| 475 | - if not row or row.get("error"): | |
| 476 | - return | |
| 477 | - | |
| 478 | - _append_analysis_attributes( | |
| 479 | - result["enriched_taxonomy_attributes"], | |
| 480 | - row=row, | |
| 481 | - lang=lang, | |
| 482 | - schema=_get_analysis_schema( | |
| 483 | - "taxonomy", | |
| 484 | - category_taxonomy_profile=category_taxonomy_profile, | |
| 485 | - ), | |
| 486 | - field_map=_get_taxonomy_attribute_field_map(category_taxonomy_profile), | |
| 487 | - ) | |
| 488 | - | |
| 489 | - | |
| 490 | -def _normalize_index_content_item(item: Dict[str, Any]) -> Dict[str, str]: | |
| 491 | - item_id = _get_product_id(item) | |
| 492 | - return { | |
| 493 | - "id": item_id, | |
| 494 | - "title": str(item.get("title") or "").strip(), | |
| 495 | - "brief": str(item.get("brief") or "").strip(), | |
| 496 | - "description": str(item.get("description") or "").strip(), | |
| 497 | - "image_url": str(item.get("image_url") or "").strip(), | |
| 498 | - } | |
| 499 | - | |
| 500 | - | |
| 501 | -def build_index_content_fields( | |
| 502 | - items: List[Dict[str, Any]], | |
| 503 | - tenant_id: Optional[str] = None, | |
| 504 | - enrichment_scopes: Optional[List[str]] = None, | |
| 505 | - category_taxonomy_profile: Optional[str] = None, | |
| 506 | -) -> List[Dict[str, Any]]: | |
| 507 | - """ | |
| 508 | - 高层入口:生成与 ES mapping 对齐的内容理解字段。 | |
| 509 | - | |
| 510 | - 输入项需包含: | |
| 511 | - - `id` 或 `spu_id` | |
| 512 | - - `title` | |
| 513 | - - 可选 `brief` / `description` / `image_url` | |
| 514 | - - 可选 `enrichment_scopes`,默认同时执行 `generic` 与 `category_taxonomy` | |
| 515 | - - 可选 `category_taxonomy_profile`,默认 `apparel` | |
| 516 | - | |
| 517 | - 返回项结构: | |
| 518 | - - `id` | |
| 519 | - - `qanchors` | |
| 520 | - - `enriched_tags` | |
| 521 | - - `enriched_attributes` | |
| 522 | - - `enriched_taxonomy_attributes` | |
| 523 | - - 可选 `error` | |
| 524 | - | |
| 525 | - 其中: | |
| 526 | - - `qanchors.{lang}` 为短语数组 | |
| 527 | - - `enriched_tags.{lang}` 为标签数组 | |
| 528 | - """ | |
| 529 | - requested_enrichment_scopes = _normalize_enrichment_scopes(enrichment_scopes) | |
| 530 | - normalized_taxonomy_profile = _normalize_category_taxonomy_profile(category_taxonomy_profile) | |
| 531 | - normalized_items = [_normalize_index_content_item(item) for item in items] | |
| 532 | - if not normalized_items: | |
| 533 | - return [] | |
| 534 | - | |
| 535 | - results_by_id: Dict[str, Dict[str, Any]] = { | |
| 536 | - item["id"]: { | |
| 537 | - "id": item["id"], | |
| 538 | - "qanchors": {}, | |
| 539 | - "enriched_tags": {}, | |
| 540 | - "enriched_attributes": [], | |
| 541 | - "enriched_taxonomy_attributes": [], | |
| 542 | - } | |
| 543 | - for item in normalized_items | |
| 544 | - } | |
| 545 | - | |
| 546 | - for lang in _CORE_INDEX_LANGUAGES: | |
| 547 | - if "generic" in requested_enrichment_scopes: | |
| 548 | - try: | |
| 549 | - rows = analyze_products( | |
| 550 | - products=normalized_items, | |
| 551 | - target_lang=lang, | |
| 552 | - batch_size=BATCH_SIZE, | |
| 553 | - tenant_id=tenant_id, | |
| 554 | - analysis_kind="content", | |
| 555 | - category_taxonomy_profile=normalized_taxonomy_profile, | |
| 556 | - ) | |
| 557 | - except Exception as e: | |
| 558 | - logger.warning("build_index_content_fields content enrichment failed for lang=%s: %s", lang, e) | |
| 559 | - for item in normalized_items: | |
| 560 | - results_by_id[item["id"]].setdefault("error", str(e)) | |
| 561 | - continue | |
| 562 | - | |
| 563 | - for row in rows or []: | |
| 564 | - item_id = str(row.get("id") or "").strip() | |
| 565 | - if not item_id or item_id not in results_by_id: | |
| 566 | - continue | |
| 567 | - if row.get("error"): | |
| 568 | - results_by_id[item_id].setdefault("error", row["error"]) | |
| 569 | - continue | |
| 570 | - _apply_index_content_row(results_by_id[item_id], row=row, lang=lang) | |
| 571 | - | |
| 572 | - if "category_taxonomy" in requested_enrichment_scopes: | |
| 573 | - for lang in _CORE_INDEX_LANGUAGES: | |
| 574 | - try: | |
| 575 | - taxonomy_rows = analyze_products( | |
| 576 | - products=normalized_items, | |
| 577 | - target_lang=lang, | |
| 578 | - batch_size=BATCH_SIZE, | |
| 579 | - tenant_id=tenant_id, | |
| 580 | - analysis_kind="taxonomy", | |
| 581 | - category_taxonomy_profile=normalized_taxonomy_profile, | |
| 582 | - ) | |
| 583 | - except Exception as e: | |
| 584 | - logger.warning( | |
| 585 | - "build_index_content_fields taxonomy enrichment failed for profile=%s lang=%s: %s", | |
| 586 | - normalized_taxonomy_profile, | |
| 587 | - lang, | |
| 588 | - e, | |
| 589 | - ) | |
| 590 | - for item in normalized_items: | |
| 591 | - results_by_id[item["id"]].setdefault("error", str(e)) | |
| 592 | - continue | |
| 593 | - | |
| 594 | - for row in taxonomy_rows or []: | |
| 595 | - item_id = str(row.get("id") or "").strip() | |
| 596 | - if not item_id or item_id not in results_by_id: | |
| 597 | - continue | |
| 598 | - if row.get("error"): | |
| 599 | - results_by_id[item_id].setdefault("error", row["error"]) | |
| 600 | - continue | |
| 601 | - _apply_index_taxonomy_row( | |
| 602 | - results_by_id[item_id], | |
| 603 | - row=row, | |
| 604 | - lang=lang, | |
| 605 | - category_taxonomy_profile=normalized_taxonomy_profile, | |
| 606 | - ) | |
| 607 | - | |
| 608 | - return [results_by_id[item["id"]] for item in normalized_items] | |
| 609 | - | |
| 610 | - | |
| 611 | -def _normalize_space(text: str) -> str: | |
| 612 | - return re.sub(r"\s+", " ", (text or "").strip()) | |
| 613 | - | |
| 614 | - | |
| 615 | -def _contains_cjk(text: str) -> bool: | |
| 616 | - return bool(re.search(r"[\u3400-\u4dbf\u4e00-\u9fff\uf900-\ufaff]", text or "")) | |
| 617 | - | |
| 618 | - | |
| 619 | -def _truncate_by_chars(text: str, max_chars: int) -> str: | |
| 620 | - return text[:max_chars].strip() | |
| 621 | - | |
| 622 | - | |
| 623 | -def _truncate_by_words(text: str, max_words: int) -> str: | |
| 624 | - words = re.findall(r"\S+", text or "") | |
| 625 | - return " ".join(words[:max_words]).strip() | |
| 626 | - | |
| 627 | - | |
| 628 | -def _detect_prompt_input_lang(text: str) -> str: | |
| 629 | - # 简化处理:包含 CJK 时按中文类文本处理,否则统一按空格分词类语言处理。 | |
| 630 | - return "zh" if _contains_cjk(text) else "en" | |
| 631 | - | |
| 632 | - | |
| 633 | -def _build_prompt_input_text(product: Dict[str, Any]) -> str: | |
| 634 | - """ | |
| 635 | - 生成真正送入 prompt 的商品文本。 | |
| 636 | - | |
| 637 | - 规则: | |
| 638 | - - 默认使用 title | |
| 639 | - - 若文本过短,则依次补 brief / description | |
| 640 | - - 若文本过长,则按语言粗粒度截断 | |
| 641 | - """ | |
| 642 | - fields = [ | |
| 643 | - _normalize_space(str(product.get("title") or "")), | |
| 644 | - _normalize_space(str(product.get("brief") or "")), | |
| 645 | - _normalize_space(str(product.get("description") or "")), | |
| 646 | - ] | |
| 647 | - parts: List[str] = [] | |
| 648 | - | |
| 649 | - def join_parts() -> str: | |
| 650 | - return " | ".join(part for part in parts if part).strip() | |
| 651 | - | |
| 652 | - for field in fields: | |
| 653 | - if not field: | |
| 654 | - continue | |
| 655 | - if field not in parts: | |
| 656 | - parts.append(field) | |
| 657 | - candidate = join_parts() | |
| 658 | - if _detect_prompt_input_lang(candidate) == "zh": | |
| 659 | - if len(candidate) >= PROMPT_INPUT_MIN_ZH_CHARS: | |
| 660 | - return _truncate_by_chars(candidate, PROMPT_INPUT_MAX_ZH_CHARS) | |
| 661 | - else: | |
| 662 | - if len(re.findall(r"\S+", candidate)) >= PROMPT_INPUT_MIN_WORDS: | |
| 663 | - return _truncate_by_words(candidate, PROMPT_INPUT_MAX_WORDS) | |
| 664 | - | |
| 665 | - candidate = join_parts() | |
| 666 | - if not candidate: | |
| 667 | - return "" | |
| 668 | - if _detect_prompt_input_lang(candidate) == "zh": | |
| 669 | - return _truncate_by_chars(candidate, PROMPT_INPUT_MAX_ZH_CHARS) | |
| 670 | - return _truncate_by_words(candidate, PROMPT_INPUT_MAX_WORDS) | |
| 671 | - | |
| 672 | - | |
| 673 | -def _make_analysis_cache_key( | |
| 674 | - product: Dict[str, Any], | |
| 675 | - target_lang: str, | |
| 676 | - analysis_kind: str, | |
| 677 | - category_taxonomy_profile: Optional[str] = None, | |
| 678 | -) -> str: | |
| 679 | - """构造缓存 key,仅由分析类型、prompt 实际输入文本内容与目标语言决定。""" | |
| 680 | - schema = _get_analysis_schema( | |
| 681 | - analysis_kind, | |
| 682 | - category_taxonomy_profile=category_taxonomy_profile, | |
| 683 | - ) | |
| 684 | - prompt_input = _build_prompt_input_text(product) | |
| 685 | - h = hashlib.md5(prompt_input.encode("utf-8")).hexdigest() | |
| 686 | - prompt_contract = { | |
| 687 | - "schema_name": schema.name, | |
| 688 | - "cache_version": schema.cache_version, | |
| 689 | - "system_message": SYSTEM_MESSAGE, | |
| 690 | - "user_instruction_template": USER_INSTRUCTION_TEMPLATE, | |
| 691 | - "shared_instruction": schema.shared_instruction, | |
| 692 | - "assistant_headers": schema.get_headers(target_lang), | |
| 693 | - "result_fields": schema.result_fields, | |
| 694 | - "meaningful_fields": schema.meaningful_fields, | |
| 695 | - "field_aliases": schema.field_aliases, | |
| 696 | - } | |
| 697 | - prompt_contract_hash = hashlib.md5( | |
| 698 | - json.dumps(prompt_contract, ensure_ascii=False, sort_keys=True).encode("utf-8") | |
| 699 | - ).hexdigest()[:12] | |
| 700 | - return ( | |
| 701 | - f"{ANCHOR_CACHE_PREFIX}:{analysis_kind}:{prompt_contract_hash}:" | |
| 702 | - f"{target_lang}:{prompt_input[:4]}{h}" | |
| 703 | - ) | |
| 704 | - | |
| 705 | - | |
| 706 | -def _make_anchor_cache_key( | |
| 707 | - product: Dict[str, Any], | |
| 708 | - target_lang: str, | |
| 709 | -) -> str: | |
| 710 | - return _make_analysis_cache_key(product, target_lang, analysis_kind="content") | |
| 711 | - | |
| 712 | - | |
| 713 | -def _get_cached_analysis_result( | |
| 714 | - product: Dict[str, Any], | |
| 715 | - target_lang: str, | |
| 716 | - analysis_kind: str, | |
| 717 | - category_taxonomy_profile: Optional[str] = None, | |
| 718 | -) -> Optional[Dict[str, Any]]: | |
| 719 | - if not _anchor_redis: | |
| 720 | - return None | |
| 721 | - schema = _get_analysis_schema( | |
| 722 | - analysis_kind, | |
| 723 | - category_taxonomy_profile=category_taxonomy_profile, | |
| 724 | - ) | |
| 725 | - try: | |
| 726 | - key = _make_analysis_cache_key( | |
| 727 | - product, | |
| 728 | - target_lang, | |
| 729 | - analysis_kind, | |
| 730 | - category_taxonomy_profile=category_taxonomy_profile, | |
| 731 | - ) | |
| 732 | - raw = _anchor_redis.get(key) | |
| 733 | - if not raw: | |
| 734 | - return None | |
| 735 | - result = _normalize_analysis_result( | |
| 736 | - json.loads(raw), | |
| 737 | - product=product, | |
| 738 | - target_lang=target_lang, | |
| 739 | - schema=schema, | |
| 740 | - ) | |
| 741 | - if not _has_meaningful_analysis_content(result, schema): | |
| 742 | - return None | |
| 743 | - return result | |
| 744 | - except Exception as e: | |
| 745 | - logger.warning("Failed to get %s analysis cache: %s", analysis_kind, e) | |
| 746 | - return None | |
| 747 | - | |
| 748 | - | |
| 749 | -def _get_cached_anchor_result( | |
| 750 | - product: Dict[str, Any], | |
| 751 | - target_lang: str, | |
| 752 | -) -> Optional[Dict[str, Any]]: | |
| 753 | - return _get_cached_analysis_result(product, target_lang, analysis_kind="content") | |
| 754 | - | |
| 755 | - | |
| 756 | -def _set_cached_analysis_result( | |
| 757 | - product: Dict[str, Any], | |
| 758 | - target_lang: str, | |
| 759 | - result: Dict[str, Any], | |
| 760 | - analysis_kind: str, | |
| 761 | - category_taxonomy_profile: Optional[str] = None, | |
| 762 | -) -> None: | |
| 763 | - if not _anchor_redis: | |
| 764 | - return | |
| 765 | - schema = _get_analysis_schema( | |
| 766 | - analysis_kind, | |
| 767 | - category_taxonomy_profile=category_taxonomy_profile, | |
| 768 | - ) | |
| 769 | - try: | |
| 770 | - normalized = _normalize_analysis_result( | |
| 771 | - result, | |
| 772 | - product=product, | |
| 773 | - target_lang=target_lang, | |
| 774 | - schema=schema, | |
| 775 | - ) | |
| 776 | - if not _has_meaningful_analysis_content(normalized, schema): | |
| 777 | - return | |
| 778 | - key = _make_analysis_cache_key( | |
| 779 | - product, | |
| 780 | - target_lang, | |
| 781 | - analysis_kind, | |
| 782 | - category_taxonomy_profile=category_taxonomy_profile, | |
| 783 | - ) | |
| 784 | - ttl = ANCHOR_CACHE_EXPIRE_DAYS * 24 * 3600 | |
| 785 | - _anchor_redis.setex(key, ttl, json.dumps(normalized, ensure_ascii=False)) | |
| 786 | - except Exception as e: | |
| 787 | - logger.warning("Failed to set %s analysis cache: %s", analysis_kind, e) | |
| 788 | - | |
| 789 | - | |
| 790 | -def _set_cached_anchor_result( | |
| 791 | - product: Dict[str, Any], | |
| 792 | - target_lang: str, | |
| 793 | - result: Dict[str, Any], | |
| 794 | -) -> None: | |
| 795 | - _set_cached_analysis_result(product, target_lang, result, analysis_kind="content") | |
| 796 | - | |
| 797 | - | |
| 798 | -def _build_assistant_prefix(headers: List[str]) -> str: | |
| 799 | - header_line = "| " + " | ".join(headers) + " |" | |
| 800 | - separator_line = "|" + "----|" * len(headers) | |
| 801 | - return f"{header_line}\n{separator_line}\n" | |
| 802 | - | |
| 803 | - | |
| 804 | -def _build_shared_context(products: List[Dict[str, str]], schema: AnalysisSchema) -> str: | |
| 805 | - shared_context = schema.shared_instruction | |
| 806 | - for idx, product in enumerate(products, 1): | |
| 807 | - prompt_input = _build_prompt_input_text(product) | |
| 808 | - shared_context += f"{idx}. {prompt_input}\n" | |
| 809 | - return shared_context | |
| 810 | - | |
| 811 | - | |
| 812 | -def _hash_text(text: str) -> str: | |
| 813 | - return hashlib.md5((text or "").encode("utf-8")).hexdigest()[:12] | |
| 814 | - | |
| 815 | - | |
| 816 | -def _mark_shared_context_logged_once(shared_context_key: str) -> bool: | |
| 817 | - with _logged_shared_context_lock: | |
| 818 | - if shared_context_key in _logged_shared_context_keys: | |
| 819 | - _logged_shared_context_keys.move_to_end(shared_context_key) | |
| 820 | - return False | |
| 821 | - | |
| 822 | - _logged_shared_context_keys[shared_context_key] = None | |
| 823 | - if len(_logged_shared_context_keys) > LOGGED_SHARED_CONTEXT_CACHE_SIZE: | |
| 824 | - _logged_shared_context_keys.popitem(last=False) | |
| 825 | - return True | |
| 826 | - | |
| 827 | - | |
| 828 | -def reset_logged_shared_context_keys() -> None: | |
| 829 | - """测试辅助:清理已记录的共享 prompt key。""" | |
| 830 | - with _logged_shared_context_lock: | |
| 831 | - _logged_shared_context_keys.clear() | |
| 832 | - | |
| 833 | - | |
| 834 | -def create_prompt( | |
| 835 | - products: List[Dict[str, str]], | |
| 836 | - target_lang: str = "zh", | |
| 837 | - analysis_kind: str = "content", | |
| 838 | - category_taxonomy_profile: Optional[str] = None, | |
| 839 | -) -> Tuple[Optional[str], Optional[str], Optional[str]]: | |
| 840 | - """根据目标语言创建共享上下文、本地化输出要求和 Partial Mode assistant 前缀。""" | |
| 841 | - schema = _get_analysis_schema( | |
| 842 | - analysis_kind, | |
| 843 | - category_taxonomy_profile=category_taxonomy_profile, | |
| 844 | - ) | |
| 845 | - markdown_table_headers = schema.get_headers(target_lang) | |
| 846 | - if not markdown_table_headers: | |
| 847 | - logger.warning( | |
| 848 | - "Unsupported target_lang for markdown table headers: kind=%s lang=%s", | |
| 849 | - analysis_kind, | |
| 850 | - target_lang, | |
| 851 | - ) | |
| 852 | - return None, None, None | |
| 853 | - shared_context = _build_shared_context(products, schema) | |
| 854 | - language_label = SOURCE_LANG_CODE_MAP.get(target_lang, target_lang) | |
| 855 | - user_prompt = USER_INSTRUCTION_TEMPLATE.format(language=language_label).strip() | |
| 856 | - assistant_prefix = _build_assistant_prefix(markdown_table_headers) | |
| 857 | - return shared_context, user_prompt, assistant_prefix | |
| 858 | - | |
| 859 | - | |
| 860 | -def _merge_partial_response(assistant_prefix: str, generated_content: str) -> str: | |
| 861 | - """将 Partial Mode 的 assistant 前缀与补全文本拼成完整 markdown。""" | |
| 862 | - generated = (generated_content or "").lstrip() | |
| 863 | - prefix_lines = [line.strip() for line in assistant_prefix.strip().splitlines()] | |
| 864 | - generated_lines = generated.splitlines() | |
| 865 | - | |
| 866 | - if generated_lines: | |
| 867 | - first_line = generated_lines[0].strip() | |
| 868 | - if prefix_lines and first_line == prefix_lines[0]: | |
| 869 | - generated_lines = generated_lines[1:] | |
| 870 | - if generated_lines and len(prefix_lines) > 1 and generated_lines[0].strip() == prefix_lines[1]: | |
| 871 | - generated_lines = generated_lines[1:] | |
| 872 | - elif len(prefix_lines) > 1 and first_line == prefix_lines[1]: | |
| 873 | - generated_lines = generated_lines[1:] | |
| 874 | - | |
| 875 | - suffix = "\n".join(generated_lines).lstrip("\n") | |
| 876 | - if suffix: | |
| 877 | - return f"{assistant_prefix}{suffix}" | |
| 878 | - return assistant_prefix | |
| 879 | - | |
| 880 | - | |
| 881 | -def call_llm( | |
| 882 | - shared_context: str, | |
| 883 | - user_prompt: str, | |
| 884 | - assistant_prefix: str, | |
| 885 | - target_lang: str = "zh", | |
| 886 | - analysis_kind: str = "content", | |
| 887 | -) -> Tuple[str, str]: | |
| 888 | - """调用大模型 API(带重试机制),使用 Partial Mode 强制 markdown 表格前缀。""" | |
| 889 | - headers = { | |
| 890 | - "Authorization": f"Bearer {API_KEY}", | |
| 891 | - "Content-Type": "application/json", | |
| 892 | - } | |
| 893 | - shared_context_key = _hash_text(shared_context) | |
| 894 | - localized_tail_key = _hash_text(f"{target_lang}\n{user_prompt}\n{assistant_prefix}") | |
| 895 | - combined_user_prompt = f"{shared_context.rstrip()}\n\n{user_prompt.strip()}" | |
| 896 | - | |
| 897 | - payload = { | |
| 898 | - "model": MODEL_NAME, | |
| 899 | - "messages": [ | |
| 900 | - { | |
| 901 | - "role": "system", | |
| 902 | - "content": SYSTEM_MESSAGE, | |
| 903 | - }, | |
| 904 | - { | |
| 905 | - "role": "user", | |
| 906 | - "content": combined_user_prompt, | |
| 907 | - }, | |
| 908 | - { | |
| 909 | - "role": "assistant", | |
| 910 | - "content": assistant_prefix, | |
| 911 | - "partial": True, | |
| 912 | - }, | |
| 913 | - ], | |
| 914 | - "temperature": 0.3, | |
| 915 | - "top_p": 0.8, | |
| 916 | - } | |
| 917 | - | |
| 918 | - request_data = { | |
| 919 | - "headers": {k: v for k, v in headers.items() if k != "Authorization"}, | |
| 920 | - "payload": payload, | |
| 921 | - } | |
| 922 | - | |
| 923 | - if _mark_shared_context_logged_once(shared_context_key): | |
| 924 | - logger.info(f"\n{'=' * 80}") | |
| 925 | - logger.info( | |
| 926 | - "LLM Shared Context [model=%s, kind=%s, shared_key=%s, chars=%s] (logged once per process key)", | |
| 927 | - MODEL_NAME, | |
| 928 | - analysis_kind, | |
| 929 | - shared_context_key, | |
| 930 | - len(shared_context), | |
| 931 | - ) | |
| 932 | - logger.info("\nSystem Message:\n%s", SYSTEM_MESSAGE) | |
| 933 | - logger.info("\nShared Context:\n%s", shared_context) | |
| 934 | - | |
| 935 | - verbose_logger.info(f"\n{'=' * 80}") | |
| 936 | - verbose_logger.info( | |
| 937 | - "LLM Request [model=%s, kind=%s, lang=%s, shared_key=%s, tail_key=%s]:", | |
| 938 | - MODEL_NAME, | |
| 939 | - analysis_kind, | |
| 940 | - target_lang, | |
| 941 | - shared_context_key, | |
| 942 | - localized_tail_key, | |
| 943 | - ) | |
| 944 | - verbose_logger.info(json.dumps(request_data, ensure_ascii=False, indent=2)) | |
| 945 | - verbose_logger.info(f"\nCombined User Prompt:\n{combined_user_prompt}") | |
| 946 | - verbose_logger.info(f"\nShared Context:\n{shared_context}") | |
| 947 | - verbose_logger.info(f"\nLocalized Requirement:\n{user_prompt}") | |
| 948 | - verbose_logger.info(f"\nAssistant Prefix:\n{assistant_prefix}") | |
| 949 | - | |
| 950 | - logger.info( | |
| 951 | - "\nLLM Request Variant [kind=%s, lang=%s, shared_key=%s, tail_key=%s, prompt_chars=%s, prefix_chars=%s]", | |
| 952 | - analysis_kind, | |
| 953 | - target_lang, | |
| 954 | - shared_context_key, | |
| 955 | - localized_tail_key, | |
| 956 | - len(user_prompt), | |
| 957 | - len(assistant_prefix), | |
| 958 | - ) | |
| 959 | - logger.info("\nLocalized Requirement:\n%s", user_prompt) | |
| 960 | - logger.info("\nAssistant Prefix:\n%s", assistant_prefix) | |
| 961 | - | |
| 962 | - # 创建session,禁用代理 | |
| 963 | - session = requests.Session() | |
| 964 | - session.trust_env = False # 忽略系统代理设置 | |
| 965 | - | |
| 966 | - try: | |
| 967 | - # 重试机制 | |
| 968 | - for attempt in range(MAX_RETRIES): | |
| 969 | - try: | |
| 970 | - response = session.post( | |
| 971 | - f"{API_BASE_URL}/chat/completions", | |
| 972 | - headers=headers, | |
| 973 | - json=payload, | |
| 974 | - timeout=REQUEST_TIMEOUT, | |
| 975 | - proxies={"http": None, "https": None}, # 明确禁用代理 | |
| 976 | - ) | |
| 977 | - | |
| 978 | - response.raise_for_status() | |
| 979 | - result = response.json() | |
| 980 | - usage = result.get("usage") or {} | |
| 981 | - | |
| 982 | - verbose_logger.info( | |
| 983 | - "\nLLM Response [model=%s, kind=%s, lang=%s, shared_key=%s, tail_key=%s]:", | |
| 984 | - MODEL_NAME, | |
| 985 | - analysis_kind, | |
| 986 | - target_lang, | |
| 987 | - shared_context_key, | |
| 988 | - localized_tail_key, | |
| 989 | - ) | |
| 990 | - verbose_logger.info(json.dumps(result, ensure_ascii=False, indent=2)) | |
| 991 | - | |
| 992 | - generated_content = result["choices"][0]["message"]["content"] | |
| 993 | - full_markdown = _merge_partial_response(assistant_prefix, generated_content) | |
| 994 | - | |
| 995 | - logger.info( | |
| 996 | - "\nLLM Response Summary [kind=%s, lang=%s, shared_key=%s, tail_key=%s, generated_chars=%s, completion_tokens=%s, prompt_tokens=%s, total_tokens=%s]", | |
| 997 | - analysis_kind, | |
| 998 | - target_lang, | |
| 999 | - shared_context_key, | |
| 1000 | - localized_tail_key, | |
| 1001 | - len(generated_content or ""), | |
| 1002 | - usage.get("completion_tokens"), | |
| 1003 | - usage.get("prompt_tokens"), | |
| 1004 | - usage.get("total_tokens"), | |
| 1005 | - ) | |
| 1006 | - logger.info("\nGenerated Content:\n%s", generated_content) | |
| 1007 | - logger.info("\nMerged Markdown:\n%s", full_markdown) | |
| 1008 | - | |
| 1009 | - verbose_logger.info(f"\nGenerated Content:\n{generated_content}") | |
| 1010 | - verbose_logger.info(f"\nMerged Markdown:\n{full_markdown}") | |
| 1011 | - | |
| 1012 | - return full_markdown, json.dumps(result, ensure_ascii=False) | |
| 1013 | - | |
| 1014 | - except requests.exceptions.ProxyError as e: | |
| 1015 | - logger.warning(f"Attempt {attempt + 1}/{MAX_RETRIES}: Proxy error - {str(e)}") | |
| 1016 | - if attempt < MAX_RETRIES - 1: | |
| 1017 | - logger.info(f"Retrying in {RETRY_DELAY} seconds...") | |
| 1018 | - time.sleep(RETRY_DELAY) | |
| 1019 | - else: | |
| 1020 | - raise | |
| 1021 | - | |
| 1022 | - except requests.exceptions.RequestException as e: | |
| 1023 | - logger.warning(f"Attempt {attempt + 1}/{MAX_RETRIES}: Request error - {str(e)}") | |
| 1024 | - if attempt < MAX_RETRIES - 1: | |
| 1025 | - logger.info(f"Retrying in {RETRY_DELAY} seconds...") | |
| 1026 | - time.sleep(RETRY_DELAY) | |
| 1027 | - else: | |
| 1028 | - raise | |
| 1029 | - | |
| 1030 | - except Exception as e: | |
| 1031 | - logger.error(f"Unexpected error on attempt {attempt + 1}/{MAX_RETRIES}: {str(e)}") | |
| 1032 | - if attempt < MAX_RETRIES - 1: | |
| 1033 | - logger.info(f"Retrying in {RETRY_DELAY} seconds...") | |
| 1034 | - time.sleep(RETRY_DELAY) | |
| 1035 | - else: | |
| 1036 | - raise | |
| 1037 | - | |
| 1038 | - finally: | |
| 1039 | - session.close() | |
| 1040 | - | |
| 1041 | - | |
| 1042 | -def parse_markdown_table( | |
| 1043 | - markdown_content: str, | |
| 1044 | - analysis_kind: str = "content", | |
| 1045 | - category_taxonomy_profile: Optional[str] = None, | |
| 1046 | -) -> List[Dict[str, str]]: | |
| 1047 | - """解析markdown表格内容""" | |
| 1048 | - schema = _get_analysis_schema( | |
| 1049 | - analysis_kind, | |
| 1050 | - category_taxonomy_profile=category_taxonomy_profile, | |
| 1051 | - ) | |
| 1052 | - lines = markdown_content.strip().split("\n") | |
| 1053 | - data = [] | |
| 1054 | - data_started = False | |
| 1055 | - | |
| 1056 | - for line in lines: | |
| 1057 | - line = line.strip() | |
| 1058 | - if not line: | |
| 1059 | - continue | |
| 1060 | - | |
| 1061 | - # 表格行处理 | |
| 1062 | - if line.startswith("|"): | |
| 1063 | - # 分隔行(---- 或 :---: 等;允许空格,如 "| ---- | ---- |") | |
| 1064 | - sep_chars = line.replace("|", "").strip().replace(" ", "") | |
| 1065 | - if sep_chars and set(sep_chars) <= {"-", ":"}: | |
| 1066 | - data_started = True | |
| 1067 | - continue | |
| 1068 | - | |
| 1069 | - # 首个表头行:无论语言如何,统一跳过 | |
| 1070 | - if not data_started: | |
| 1071 | - # 等待下一行数据行 | |
| 1072 | - continue | |
| 1073 | - | |
| 1074 | - # 解析数据行 | |
| 1075 | - parts = [p.strip() for p in line.split("|")] | |
| 1076 | - if parts and parts[0] == "": | |
| 1077 | - parts = parts[1:] | |
| 1078 | - if parts and parts[-1] == "": | |
| 1079 | - parts = parts[:-1] | |
| 1080 | - | |
| 1081 | - if len(parts) >= 2: | |
| 1082 | - row = {"seq_no": parts[0]} | |
| 1083 | - for field_index, field_name in enumerate(schema.result_fields, start=1): | |
| 1084 | - cell = parts[field_index] if len(parts) > field_index else "" | |
| 1085 | - row[field_name] = _normalize_markdown_table_cell(cell) | |
| 1086 | - data.append(row) | |
| 1087 | - | |
| 1088 | - return data | |
| 1089 | - | |
| 1090 | - | |
| 1091 | -def _log_parsed_result_quality( | |
| 1092 | - batch_data: List[Dict[str, str]], | |
| 1093 | - parsed_results: List[Dict[str, str]], | |
| 1094 | - target_lang: str, | |
| 1095 | - batch_num: int, | |
| 1096 | - analysis_kind: str, | |
| 1097 | - category_taxonomy_profile: Optional[str] = None, | |
| 1098 | -) -> None: | |
| 1099 | - schema = _get_analysis_schema( | |
| 1100 | - analysis_kind, | |
| 1101 | - category_taxonomy_profile=category_taxonomy_profile, | |
| 1102 | - ) | |
| 1103 | - expected = len(batch_data) | |
| 1104 | - actual = len(parsed_results) | |
| 1105 | - if actual != expected: | |
| 1106 | - logger.warning( | |
| 1107 | - "Parsed row count mismatch for kind=%s batch=%s lang=%s: expected=%s actual=%s", | |
| 1108 | - analysis_kind, | |
| 1109 | - batch_num, | |
| 1110 | - target_lang, | |
| 1111 | - expected, | |
| 1112 | - actual, | |
| 1113 | - ) | |
| 1114 | - | |
| 1115 | - if not schema.quality_fields: | |
| 1116 | - logger.info( | |
| 1117 | - "Parsed Quality Summary [kind=%s, batch=%s, lang=%s]: rows=%s/%s", | |
| 1118 | - analysis_kind, | |
| 1119 | - batch_num, | |
| 1120 | - target_lang, | |
| 1121 | - actual, | |
| 1122 | - expected, | |
| 1123 | - ) | |
| 1124 | - return | |
| 1125 | - | |
| 1126 | - missing_summary = ", ".join( | |
| 1127 | - f"missing_{field}=" | |
| 1128 | - f"{sum(1 for item in parsed_results if not str(item.get(field) or '').strip())}" | |
| 1129 | - for field in schema.quality_fields | |
| 1130 | - ) | |
| 1131 | - logger.info( | |
| 1132 | - "Parsed Quality Summary [kind=%s, batch=%s, lang=%s]: rows=%s/%s, %s", | |
| 1133 | - analysis_kind, | |
| 1134 | - batch_num, | |
| 1135 | - target_lang, | |
| 1136 | - actual, | |
| 1137 | - expected, | |
| 1138 | - missing_summary, | |
| 1139 | - ) | |
| 1140 | - | |
| 1141 | - | |
| 1142 | -def process_batch( | |
| 1143 | - batch_data: List[Dict[str, str]], | |
| 1144 | - batch_num: int, | |
| 1145 | - target_lang: str = "zh", | |
| 1146 | - analysis_kind: str = "content", | |
| 1147 | - category_taxonomy_profile: Optional[str] = None, | |
| 1148 | -) -> List[Dict[str, Any]]: | |
| 1149 | - """处理一个批次的数据""" | |
| 1150 | - schema = _get_analysis_schema( | |
| 1151 | - analysis_kind, | |
| 1152 | - category_taxonomy_profile=category_taxonomy_profile, | |
| 1153 | - ) | |
| 1154 | - logger.info(f"\n{'#' * 80}") | |
| 1155 | - logger.info( | |
| 1156 | - "Processing Batch %s (%s items, kind=%s)", | |
| 1157 | - batch_num, | |
| 1158 | - len(batch_data), | |
| 1159 | - analysis_kind, | |
| 1160 | - ) | |
| 1161 | - | |
| 1162 | - # 创建提示词 | |
| 1163 | - shared_context, user_prompt, assistant_prefix = create_prompt( | |
| 1164 | - batch_data, | |
| 1165 | - target_lang=target_lang, | |
| 1166 | - analysis_kind=analysis_kind, | |
| 1167 | - category_taxonomy_profile=category_taxonomy_profile, | |
| 1168 | - ) | |
| 1169 | - | |
| 1170 | - # 如果提示词创建失败(例如不支持的 target_lang),本次批次整体失败,不再继续调用 LLM | |
| 1171 | - if shared_context is None or user_prompt is None or assistant_prefix is None: | |
| 1172 | - logger.error( | |
| 1173 | - "Failed to create prompt for batch %s, kind=%s, target_lang=%s; " | |
| 1174 | - "marking entire batch as failed without calling LLM", | |
| 1175 | - batch_num, | |
| 1176 | - analysis_kind, | |
| 1177 | - target_lang, | |
| 1178 | - ) | |
| 1179 | - return [ | |
| 1180 | - _make_empty_analysis_result( | |
| 1181 | - item, | |
| 1182 | - target_lang, | |
| 1183 | - schema, | |
| 1184 | - error=f"prompt_creation_failed: unsupported target_lang={target_lang}", | |
| 1185 | - ) | |
| 1186 | - for item in batch_data | |
| 1187 | - ] | |
| 1188 | - | |
| 1189 | - # 调用LLM | |
| 1190 | - try: | |
| 1191 | - raw_response, full_response_json = call_llm( | |
| 1192 | - shared_context, | |
| 1193 | - user_prompt, | |
| 1194 | - assistant_prefix, | |
| 1195 | - target_lang=target_lang, | |
| 1196 | - analysis_kind=analysis_kind, | |
| 1197 | - ) | |
| 1198 | - | |
| 1199 | - # 解析结果 | |
| 1200 | - parsed_results = parse_markdown_table( | |
| 1201 | - raw_response, | |
| 1202 | - analysis_kind=analysis_kind, | |
| 1203 | - category_taxonomy_profile=category_taxonomy_profile, | |
| 1204 | - ) | |
| 1205 | - _log_parsed_result_quality( | |
| 1206 | - batch_data, | |
| 1207 | - parsed_results, | |
| 1208 | - target_lang, | |
| 1209 | - batch_num, | |
| 1210 | - analysis_kind, | |
| 1211 | - category_taxonomy_profile, | |
| 1212 | - ) | |
| 1213 | - | |
| 1214 | - logger.info(f"\nParsed Results ({len(parsed_results)} items):") | |
| 1215 | - logger.info(json.dumps(parsed_results, ensure_ascii=False, indent=2)) | |
| 1216 | - | |
| 1217 | - # 映射回原始ID | |
| 1218 | - results_with_ids = [] | |
| 1219 | - for i, parsed_item in enumerate(parsed_results): | |
| 1220 | - if i < len(batch_data): | |
| 1221 | - source_product = batch_data[i] | |
| 1222 | - result = _normalize_analysis_result( | |
| 1223 | - parsed_item, | |
| 1224 | - product=source_product, | |
| 1225 | - target_lang=target_lang, | |
| 1226 | - schema=schema, | |
| 1227 | - ) | |
| 1228 | - results_with_ids.append(result) | |
| 1229 | - logger.info( | |
| 1230 | - "Mapped: kind=%s seq=%s -> original_id=%s", | |
| 1231 | - analysis_kind, | |
| 1232 | - parsed_item.get("seq_no"), | |
| 1233 | - source_product.get("id"), | |
| 1234 | - ) | |
| 1235 | - | |
| 1236 | - # 保存批次 JSON 日志到独立文件 | |
| 1237 | - batch_log = { | |
| 1238 | - "batch_num": batch_num, | |
| 1239 | - "analysis_kind": analysis_kind, | |
| 1240 | - "timestamp": datetime.now().isoformat(), | |
| 1241 | - "input_products": batch_data, | |
| 1242 | - "raw_response": raw_response, | |
| 1243 | - "full_response_json": full_response_json, | |
| 1244 | - "parsed_results": parsed_results, | |
| 1245 | - "final_results": results_with_ids, | |
| 1246 | - } | |
| 1247 | - | |
| 1248 | - # 并发写 batch json 日志时,保证文件名唯一避免覆盖 | |
| 1249 | - batch_call_id = uuid.uuid4().hex[:12] | |
| 1250 | - batch_log_file = ( | |
| 1251 | - LOG_DIR | |
| 1252 | - / f"batch_{analysis_kind}_{batch_num:04d}_{timestamp}_{batch_call_id}.json" | |
| 1253 | - ) | |
| 1254 | - with open(batch_log_file, "w", encoding="utf-8") as f: | |
| 1255 | - json.dump(batch_log, f, ensure_ascii=False, indent=2) | |
| 1256 | - | |
| 1257 | - logger.info(f"Batch log saved to: {batch_log_file}") | |
| 1258 | - | |
| 1259 | - return results_with_ids | |
| 1260 | - | |
| 1261 | - except Exception as e: | |
| 1262 | - logger.error(f"Error processing batch {batch_num}: {str(e)}", exc_info=True) | |
| 1263 | - # 返回空结果,保持ID映射 | |
| 1264 | - return [ | |
| 1265 | - _make_empty_analysis_result(item, target_lang, schema, error=str(e)) | |
| 1266 | - for item in batch_data | |
| 1267 | - ] | |
| 1268 | - | |
| 1269 | - | |
| 1270 | -def analyze_products( | |
| 1271 | - products: List[Dict[str, str]], | |
| 1272 | - target_lang: str = "zh", | |
| 1273 | - batch_size: Optional[int] = None, | |
| 1274 | - tenant_id: Optional[str] = None, | |
| 1275 | - analysis_kind: str = "content", | |
| 1276 | - category_taxonomy_profile: Optional[str] = None, | |
| 1277 | -) -> List[Dict[str, Any]]: | |
| 1278 | - """ | |
| 1279 | - 库调用入口:根据输入+语言,返回锚文本及各维度信息。 | |
| 1280 | - | |
| 1281 | - Args: | |
| 1282 | - products: [{"id": "...", "title": "..."}] | |
| 1283 | - target_lang: 输出语言 | |
| 1284 | - batch_size: 批大小,默认使用全局 BATCH_SIZE | |
| 1285 | - """ | |
| 1286 | - if not API_KEY: | |
| 1287 | - raise RuntimeError("DASHSCOPE_API_KEY is not set, cannot call LLM") | |
| 1288 | - | |
| 1289 | - if not products: | |
| 1290 | - return [] | |
| 1291 | - | |
| 1292 | - _get_analysis_schema( | |
| 1293 | - analysis_kind, | |
| 1294 | - category_taxonomy_profile=category_taxonomy_profile, | |
| 1295 | - ) | |
| 1296 | - results_by_index: List[Optional[Dict[str, Any]]] = [None] * len(products) | |
| 1297 | - uncached_items: List[Tuple[int, Dict[str, str]]] = [] | |
| 1298 | - | |
| 1299 | - for idx, product in enumerate(products): | |
| 1300 | - title = str(product.get("title") or "").strip() | |
| 1301 | - if not title: | |
| 1302 | - uncached_items.append((idx, product)) | |
| 1303 | - continue | |
| 1304 | - | |
| 1305 | - cached = _get_cached_analysis_result( | |
| 1306 | - product, | |
| 1307 | - target_lang, | |
| 1308 | - analysis_kind, | |
| 1309 | - category_taxonomy_profile=category_taxonomy_profile, | |
| 1310 | - ) | |
| 1311 | - if cached: | |
| 1312 | - logger.info( | |
| 1313 | - f"[analyze_products] Cache hit for title='{title[:50]}...', " | |
| 1314 | - f"kind={analysis_kind}, lang={target_lang}" | |
| 1315 | - ) | |
| 1316 | - results_by_index[idx] = cached | |
| 1317 | - continue | |
| 1318 | - | |
| 1319 | - uncached_items.append((idx, product)) | |
| 1320 | - | |
| 1321 | - if not uncached_items: | |
| 1322 | - return [item for item in results_by_index if item is not None] | |
| 1323 | - | |
| 1324 | - # call_llm 一次处理上限固定为 BATCH_SIZE(默认 20): | |
| 1325 | - # - 尽可能攒批处理; | |
| 1326 | - # - 即便调用方传入更大的 batch_size,也会自动按上限拆批。 | |
| 1327 | - req_bs = BATCH_SIZE if batch_size is None else int(batch_size) | |
| 1328 | - bs = max(1, min(req_bs, BATCH_SIZE)) | |
| 1329 | - total_batches = (len(uncached_items) + bs - 1) // bs | |
| 1330 | - | |
| 1331 | - batch_jobs: List[Tuple[int, List[Tuple[int, Dict[str, str]]], List[Dict[str, str]]]] = [] | |
| 1332 | - for i in range(0, len(uncached_items), bs): | |
| 1333 | - batch_num = i // bs + 1 | |
| 1334 | - batch_slice = uncached_items[i : i + bs] | |
| 1335 | - batch = [item for _, item in batch_slice] | |
| 1336 | - batch_jobs.append((batch_num, batch_slice, batch)) | |
| 1337 | - | |
| 1338 | - # 只有一个批次时走串行,减少线程池创建开销与日志/日志文件的不可控交织 | |
| 1339 | - if total_batches <= 1 or CONTENT_UNDERSTANDING_MAX_WORKERS <= 1: | |
| 1340 | - for batch_num, batch_slice, batch in batch_jobs: | |
| 1341 | - logger.info( | |
| 1342 | - f"[analyze_products] Processing batch {batch_num}/{total_batches}, " | |
| 1343 | - f"size={len(batch)}, kind={analysis_kind}, target_lang={target_lang}" | |
| 1344 | - ) | |
| 1345 | - batch_results = process_batch( | |
| 1346 | - batch, | |
| 1347 | - batch_num=batch_num, | |
| 1348 | - target_lang=target_lang, | |
| 1349 | - analysis_kind=analysis_kind, | |
| 1350 | - category_taxonomy_profile=category_taxonomy_profile, | |
| 1351 | - ) | |
| 1352 | - | |
| 1353 | - for (original_idx, product), item in zip(batch_slice, batch_results): | |
| 1354 | - results_by_index[original_idx] = item | |
| 1355 | - title_input = str(item.get("title_input") or "").strip() | |
| 1356 | - if not title_input: | |
| 1357 | - continue | |
| 1358 | - if item.get("error"): | |
| 1359 | - # 不缓存错误结果,避免放大临时故障 | |
| 1360 | - continue | |
| 1361 | - try: | |
| 1362 | - _set_cached_analysis_result( | |
| 1363 | - product, | |
| 1364 | - target_lang, | |
| 1365 | - item, | |
| 1366 | - analysis_kind, | |
| 1367 | - category_taxonomy_profile=category_taxonomy_profile, | |
| 1368 | - ) | |
| 1369 | - except Exception: | |
| 1370 | - # 已在内部记录 warning | |
| 1371 | - pass | |
| 1372 | - else: | |
| 1373 | - max_workers = min(CONTENT_UNDERSTANDING_MAX_WORKERS, len(batch_jobs)) | |
| 1374 | - logger.info( | |
| 1375 | - "[analyze_products] Using ThreadPoolExecutor for uncached batches: " | |
| 1376 | - "max_workers=%s, total_batches=%s, bs=%s, kind=%s, target_lang=%s", | |
| 1377 | - max_workers, | |
| 1378 | - total_batches, | |
| 1379 | - bs, | |
| 1380 | - analysis_kind, | |
| 1381 | - target_lang, | |
| 1382 | - ) | |
| 1383 | - | |
| 1384 | - # 只把“LLM 调用 + markdown 解析”放到线程里;Redis get/set 保持在主线程,避免并发写入带来额外风险。 | |
| 1385 | - # 注意:线程池是模块级单例,因此这里的 max_workers 主要用于日志语义(实际并发受单例池上限约束)。 | |
| 1386 | - executor = _get_content_understanding_executor() | |
| 1387 | - future_by_batch_num: Dict[int, Any] = {} | |
| 1388 | - for batch_num, _batch_slice, batch in batch_jobs: | |
| 1389 | - future_by_batch_num[batch_num] = executor.submit( | |
| 1390 | - process_batch, | |
| 1391 | - batch, | |
| 1392 | - batch_num=batch_num, | |
| 1393 | - target_lang=target_lang, | |
| 1394 | - analysis_kind=analysis_kind, | |
| 1395 | - category_taxonomy_profile=category_taxonomy_profile, | |
| 1396 | - ) | |
| 1397 | - | |
| 1398 | - # 按 batch_num 回填,确保输出稳定(results_by_index 是按原始 input index 映射的) | |
| 1399 | - for batch_num, batch_slice, _batch in batch_jobs: | |
| 1400 | - batch_results = future_by_batch_num[batch_num].result() | |
| 1401 | - for (original_idx, product), item in zip(batch_slice, batch_results): | |
| 1402 | - results_by_index[original_idx] = item | |
| 1403 | - title_input = str(item.get("title_input") or "").strip() | |
| 1404 | - if not title_input: | |
| 1405 | - continue | |
| 1406 | - if item.get("error"): | |
| 1407 | - # 不缓存错误结果,避免放大临时故障 | |
| 1408 | - continue | |
| 1409 | - try: | |
| 1410 | - _set_cached_analysis_result( | |
| 1411 | - product, | |
| 1412 | - target_lang, | |
| 1413 | - item, | |
| 1414 | - analysis_kind, | |
| 1415 | - category_taxonomy_profile=category_taxonomy_profile, | |
| 1416 | - ) | |
| 1417 | - except Exception: | |
| 1418 | - # 已在内部记录 warning | |
| 1419 | - pass | |
| 1420 | - | |
| 1421 | - return [item for item in results_by_index if item is not None] |
indexer/product_enrich_prompts.py deleted
| ... | ... | @@ -1,849 +0,0 @@ |
| 1 | -#!/usr/bin/env python3 | |
| 2 | - | |
| 3 | -from typing import Any, Dict, Tuple | |
| 4 | - | |
| 5 | -SYSTEM_MESSAGE = ( | |
| 6 | - "You are an e-commerce product annotator. " | |
| 7 | - "Continue the provided assistant Markdown table prefix. " | |
| 8 | - "Do not repeat or modify the prefix, and do not add explanations outside the table." | |
| 9 | -) | |
| 10 | - | |
| 11 | -SHARED_ANALYSIS_INSTRUCTION = """Analyze each input product text and fill these columns: | |
| 12 | - | |
| 13 | -1. Product title: a natural, localized product name based on the input text | |
| 14 | -2. Category path: a concise category hierarchy from broad to specific, separated by ">" | |
| 15 | -3. Fine-grained tags: concise tags for style, features, design details, function, or standout selling points | |
| 16 | -4. Target audience: gender, age group, body type, or suitable users when clearly implied | |
| 17 | -5. Usage scene: likely occasions, settings, or use cases | |
| 18 | -6. Applicable season: relevant season(s) based on the product text | |
| 19 | -7. Key attributes: core product attributes and specifications. Depending on the item type, this may include fit, silhouette, length, sleeve type, neckline, waistline, closure, pattern, design details, structure, or other relevant attribute dimensions | |
| 20 | -8. Material description: material, fabric, texture, or construction description | |
| 21 | -9. Functional features: practical or performance-related functions such as stretch, breathability, warmth, support, storage, protection, or ease of wear | |
| 22 | -10. Anchor text: a search-oriented keyword string covering product type, category intent, attributes, design cues, usage scenarios, and strong shopping phrases | |
| 23 | - | |
| 24 | -Rules: | |
| 25 | -- Keep the input order and row count exactly the same. | |
| 26 | -- Infer only from the provided input product text; if uncertain, prefer concise and broadly correct ecommerce wording. | |
| 27 | -- Keep category paths concise and use ">" as the separator. | |
| 28 | -- For columns with multiple values, the localized output requirement will define the delimiter. | |
| 29 | - | |
| 30 | -Input product list: | |
| 31 | -""" | |
| 32 | - | |
| 33 | -USER_INSTRUCTION_TEMPLATE = """Please strictly return a Markdown table following the given columns in the specified language. For any column containing multiple values, separate them with commas. Do not add any other explanation. | |
| 34 | -Language: {language}""" | |
| 35 | - | |
| 36 | -def _taxonomy_field( | |
| 37 | - key: str, | |
| 38 | - label: str, | |
| 39 | - description: str, | |
| 40 | - zh_label: str | None = None, | |
| 41 | -) -> Dict[str, str]: | |
| 42 | - return { | |
| 43 | - "key": key, | |
| 44 | - "label": label, | |
| 45 | - "description": description, | |
| 46 | - "zh_label": zh_label or label, | |
| 47 | - } | |
| 48 | - | |
| 49 | - | |
| 50 | -def _build_taxonomy_shared_instruction(profile_label: str, fields: Tuple[Dict[str, str], ...]) -> str: | |
| 51 | - lines = [ | |
| 52 | - f"Analyze each input product text and fill the columns below using a {profile_label} attribute taxonomy.", | |
| 53 | - "", | |
| 54 | - "Output columns:", | |
| 55 | - ] | |
| 56 | - for idx, field in enumerate(fields, start=1): | |
| 57 | - lines.append(f"{idx}. {field['label']}: {field['description']}") | |
| 58 | - lines.extend( | |
| 59 | - [ | |
| 60 | - "", | |
| 61 | - "Rules:", | |
| 62 | - "- Keep the same row order and row count as input.", | |
| 63 | - "- Leave blank if not applicable, unmentioned, or unsupported.", | |
| 64 | - "- Use concise, standardized ecommerce wording.", | |
| 65 | - "- If multiple values, separate with commas.", | |
| 66 | - "", | |
| 67 | - "Input product list:", | |
| 68 | - ] | |
| 69 | - ) | |
| 70 | - return "\n".join(lines) | |
| 71 | - | |
| 72 | - | |
| 73 | -def _make_taxonomy_profile( | |
| 74 | - profile_label: str, | |
| 75 | - fields: Tuple[Dict[str, str], ...], | |
| 76 | -) -> Dict[str, Any]: | |
| 77 | - headers = { | |
| 78 | - "en": ["No.", *[field["label"] for field in fields]], | |
| 79 | - "zh": ["序号", *[field["zh_label"] for field in fields]], | |
| 80 | - } | |
| 81 | - return { | |
| 82 | - "profile_label": profile_label, | |
| 83 | - "fields": fields, | |
| 84 | - "shared_instruction": _build_taxonomy_shared_instruction(profile_label, fields), | |
| 85 | - "markdown_table_headers": headers, | |
| 86 | - } | |
| 87 | - | |
| 88 | - | |
| 89 | -APPAREL_TAXONOMY_FIELDS = ( | |
| 90 | - _taxonomy_field("product_type", "Product Type", "concise ecommerce apparel category label, not a full marketing title", "品类"), | |
| 91 | - _taxonomy_field("target_gender", "Target Gender", "intended gender only if clearly implied", "目标性别"), | |
| 92 | - _taxonomy_field("age_group", "Age Group", "only if clearly implied, e.g. adults, kids, teens, toddlers, babies", "年龄段"), | |
| 93 | - _taxonomy_field("season", "Season", "season(s) or all-season suitability only if supported", "适用季节"), | |
| 94 | - _taxonomy_field("fit", "Fit", "body closeness, e.g. slim, regular, relaxed, oversized, fitted", "版型"), | |
| 95 | - _taxonomy_field("silhouette", "Silhouette", "overall garment shape, e.g. straight, A-line, boxy, tapered, bodycon, wide-leg", "廓形"), | |
| 96 | - _taxonomy_field("neckline", "Neckline", "neckline type when applicable, e.g. crew neck, V-neck, hooded, collared, square neck", "领型"), | |
| 97 | - _taxonomy_field("sleeve_length_type", "Sleeve Length Type", "sleeve length only, e.g. sleeveless, short sleeve, long sleeve, three-quarter sleeve", "袖长类型"), | |
| 98 | - _taxonomy_field("sleeve_style", "Sleeve Style", "sleeve design only, e.g. puff sleeve, raglan sleeve, batwing sleeve, bell sleeve", "袖型"), | |
| 99 | - _taxonomy_field("strap_type", "Strap Type", "strap design when applicable, e.g. spaghetti strap, wide strap, halter strap, adjustable strap", "肩带设计"), | |
| 100 | - _taxonomy_field("rise_waistline", "Rise / Waistline", "waist placement when applicable, e.g. high rise, mid rise, low rise, empire waist", "腰型"), | |
| 101 | - _taxonomy_field("leg_shape", "Leg Shape", "for bottoms only, e.g. straight leg, wide leg, flare leg, tapered leg, skinny leg", "裤型"), | |
| 102 | - _taxonomy_field("skirt_shape", "Skirt Shape", "for skirts only, e.g. A-line, pleated, pencil, mermaid", "裙型"), | |
| 103 | - _taxonomy_field("length_type", "Length Type", "design length only, not size, e.g. cropped, regular, longline, mini, midi, maxi, ankle length, full length", "长度类型"), | |
| 104 | - _taxonomy_field("closure_type", "Closure Type", "fastening method when applicable, e.g. zipper, button, drawstring, elastic waist, hook-and-loop", "闭合方式"), | |
| 105 | - _taxonomy_field("design_details", "Design Details", "construction or visual details, e.g. ruched, ruffled, pleated, cut-out, layered, distressed, split hem", "设计细节"), | |
| 106 | - _taxonomy_field("fabric", "Fabric", "fabric type only, e.g. denim, knit, chiffon, jersey, fleece, cotton twill", "面料"), | |
| 107 | - _taxonomy_field("material_composition", "Material Composition", "fiber content or blend only if stated, e.g. cotton, polyester, spandex, linen blend, 95% cotton 5% elastane", "成分"), | |
| 108 | - _taxonomy_field("fabric_properties", "Fabric Properties", "inherent fabric traits, e.g. stretch, breathable, lightweight, soft-touch, water-resistant", "面料特性"), | |
| 109 | - _taxonomy_field("clothing_features", "Clothing Features", "product features, e.g. lined, reversible, hooded, packable, padded, pocketed", "服装特征"), | |
| 110 | - _taxonomy_field("functional_benefits", "Functional Benefits", "wearer benefits, e.g. moisture-wicking, thermal insulation, UV protection, easy care, supportive compression", "功能"), | |
| 111 | - _taxonomy_field("color", "Color", "specific color name when available", "主颜色"), | |
| 112 | - _taxonomy_field("color_family", "Color Family", "normalized broad retail color group, e.g. black, white, blue, green, red, pink, beige, brown, gray", "色系"), | |
| 113 | - _taxonomy_field("print_pattern", "Print / Pattern", "surface pattern when applicable, e.g. solid, striped, plaid, floral, graphic, animal print", "印花 / 图案"), | |
| 114 | - _taxonomy_field("occasion_end_use", "Occasion / End Use", "likely use occasion only if supported, e.g. office, casual wear, streetwear, lounge, workout, outdoor", "适用场景"), | |
| 115 | - _taxonomy_field("style_aesthetic", "Style Aesthetic", "overall style only if supported, e.g. minimalist, streetwear, athleisure, smart casual, romantic, playful", "风格"), | |
| 116 | -) | |
| 117 | - | |
| 118 | -THREE_C_TAXONOMY_FIELDS = ( | |
| 119 | - _taxonomy_field("product_type", "Product Type", "concise 3C accessory or peripheral category label", "品类"), | |
| 120 | - _taxonomy_field("compatible_device", "Compatible Device / Model", "supported device family, series, model, or form factor when clearly stated", "适配设备 / 型号"), | |
| 121 | - _taxonomy_field("connectivity", "Connectivity", "connection method such as wired, wireless, Bluetooth, Wi-Fi, NFC, or 2.4G", "连接方式"), | |
| 122 | - _taxonomy_field("interface_port_type", "Interface / Port Type", "relevant connector or port, e.g. USB-C, Lightning, HDMI, AUX, RJ45", "接口 / 端口类型"), | |
| 123 | - _taxonomy_field("power_charging", "Power Source / Charging", "charging or power mode, e.g. battery powered, fast charging, rechargeable, plug-in", "供电 / 充电方式"), | |
| 124 | - _taxonomy_field("key_features", "Key Features", "primary hardware features such as noise cancelling, foldable, magnetic, backlit, waterproof", "关键特征"), | |
| 125 | - _taxonomy_field("material_finish", "Material / Finish", "main material or exterior finish when supported", "材质 / 表面处理"), | |
| 126 | - _taxonomy_field("color", "Color", "specific color name when available", "主颜色"), | |
| 127 | - _taxonomy_field("pack_size", "Pack Size", "unit count or bundle size when stated", "包装规格"), | |
| 128 | - _taxonomy_field("use_case", "Use Case", "intended usage such as travel, office, gaming, car, charging, streaming", "使用场景"), | |
| 129 | -) | |
| 130 | - | |
| 131 | -BAGS_TAXONOMY_FIELDS = ( | |
| 132 | - _taxonomy_field("product_type", "Product Type", "concise bag category such as backpack, tote bag, crossbody bag, luggage, or wallet", "品类"), | |
| 133 | - _taxonomy_field("target_gender", "Target Gender", "intended gender only if clearly implied", "目标性别"), | |
| 134 | - _taxonomy_field("carry_style", "Carry Style", "how the bag is worn or carried, e.g. handheld, shoulder, crossbody, backpack", "携带方式"), | |
| 135 | - _taxonomy_field("size_capacity", "Size / Capacity", "size tier or capacity when supported, e.g. mini, large capacity, 20L", "尺寸 / 容量"), | |
| 136 | - _taxonomy_field("material", "Material", "main bag material such as leather, nylon, canvas, PU, straw", "材质"), | |
| 137 | - _taxonomy_field("closure_type", "Closure Type", "bag closure such as zipper, flap, buckle, drawstring, magnetic snap", "闭合方式"), | |
| 138 | - _taxonomy_field("structure_compartments", "Structure / Compartments", "organizational structure such as multi-pocket, laptop sleeve, card slots, expandable", "结构 / 分层"), | |
| 139 | - _taxonomy_field("strap_handle_type", "Strap / Handle Type", "strap or handle design such as chain strap, top handle, adjustable strap", "肩带 / 提手类型"), | |
| 140 | - _taxonomy_field("color", "Color", "specific color name when available", "主颜色"), | |
| 141 | - _taxonomy_field("occasion_end_use", "Occasion / End Use", "likely use such as commute, travel, evening, school, casual", "适用场景"), | |
| 142 | -) | |
| 143 | - | |
| 144 | -PET_SUPPLIES_TAXONOMY_FIELDS = ( | |
| 145 | - _taxonomy_field("product_type", "Product Type", "concise pet supplies category label", "品类"), | |
| 146 | - _taxonomy_field("pet_type", "Pet Type", "target pet such as dog, cat, bird, fish, hamster", "宠物类型"), | |
| 147 | - _taxonomy_field("breed_size", "Breed Size", "pet size or breed size when stated, e.g. small breed, large dogs", "体型 / 品种大小"), | |
| 148 | - _taxonomy_field("life_stage", "Life Stage", "pet age stage when supported, e.g. puppy, kitten, adult, senior", "成长阶段"), | |
| 149 | - _taxonomy_field("material_ingredients", "Material / Ingredients", "main material or ingredient composition when supported", "材质 / 成分"), | |
| 150 | - _taxonomy_field("flavor_scent", "Flavor / Scent", "flavor or scent when applicable", "口味 / 气味"), | |
| 151 | - _taxonomy_field("key_features", "Key Features", "primary attributes such as interactive, leak-proof, orthopedic, washable, elevated", "关键特征"), | |
| 152 | - _taxonomy_field("functional_benefits", "Functional Benefits", "benefits such as dental care, calming, digestion support, joint support", "功能"), | |
| 153 | - _taxonomy_field("size_capacity", "Size / Capacity", "size, count, or net content when stated", "尺寸 / 容量"), | |
| 154 | - _taxonomy_field("use_scenario", "Use Scenario", "usage such as feeding, training, grooming, travel, indoor play", "使用场景"), | |
| 155 | -) | |
| 156 | - | |
| 157 | -ELECTRONICS_TAXONOMY_FIELDS = ( | |
| 158 | - _taxonomy_field("product_type", "Product Type", "concise electronics device or component category label", "品类"), | |
| 159 | - _taxonomy_field("device_category", "Device Category / Compatibility", "supported platform, component class, or compatible device family when stated", "设备类别 / 兼容性"), | |
| 160 | - _taxonomy_field("power_voltage", "Power / Voltage", "power, voltage, wattage, or battery spec when supported", "功率 / 电压"), | |
| 161 | - _taxonomy_field("connectivity", "Connectivity", "connection method such as wired, Bluetooth, Wi-Fi, RF, or smart app control", "连接方式"), | |
| 162 | - _taxonomy_field("interface_port_type", "Interface / Port Type", "relevant port or interface such as USB-C, AC plug type, HDMI, SATA", "接口 / 端口类型"), | |
| 163 | - _taxonomy_field("capacity_storage", "Capacity / Storage", "capacity or storage spec such as 256GB, 2TB, 5000mAh", "容量 / 存储"), | |
| 164 | - _taxonomy_field("key_features", "Key Features", "main product features such as touch control, HD display, noise reduction, smart control", "关键特征"), | |
| 165 | - _taxonomy_field("material_finish", "Material / Finish", "main housing material or finish when supported", "材质 / 表面处理"), | |
| 166 | - _taxonomy_field("color", "Color", "specific color name when available", "主颜色"), | |
| 167 | - _taxonomy_field("use_case", "Use Case", "intended use such as home entertainment, office, charging, security, repair", "使用场景"), | |
| 168 | -) | |
| 169 | - | |
| 170 | -OUTDOOR_TAXONOMY_FIELDS = ( | |
| 171 | - _taxonomy_field("product_type", "Product Type", "concise outdoor gear category label", "品类"), | |
| 172 | - _taxonomy_field("activity_type", "Activity Type", "primary outdoor activity such as camping, hiking, fishing, climbing, travel", "活动类型"), | |
| 173 | - _taxonomy_field("season_weather", "Season / Weather", "season or weather suitability when supported", "适用季节 / 天气"), | |
| 174 | - _taxonomy_field("material", "Material", "main material such as aluminum, ripstop nylon, stainless steel, EVA", "材质"), | |
| 175 | - _taxonomy_field("capacity_size", "Capacity / Size", "size, length, or capacity when stated", "容量 / 尺寸"), | |
| 176 | - _taxonomy_field("protection_resistance", "Protection / Resistance", "resistance or protection such as waterproof, UV resistant, windproof", "防护 / 耐受性"), | |
| 177 | - _taxonomy_field("key_features", "Key Features", "primary gear attributes such as foldable, lightweight, insulated, non-slip", "关键特征"), | |
| 178 | - _taxonomy_field("portability_packability", "Portability / Packability", "carry or storage trait such as collapsible, compact, ultralight, packable", "便携 / 收纳性"), | |
| 179 | - _taxonomy_field("color", "Color", "specific color name when available", "主颜色"), | |
| 180 | - _taxonomy_field("use_scenario", "Use Scenario", "likely use setting such as campsite, trail, survival kit, beach, picnic", "使用场景"), | |
| 181 | -) | |
| 182 | - | |
| 183 | -HOME_APPLIANCES_TAXONOMY_FIELDS = ( | |
| 184 | - _taxonomy_field("product_type", "Product Type", "concise home appliance category label", "品类"), | |
| 185 | - _taxonomy_field("appliance_category", "Appliance Category", "functional class such as kitchen appliance, cleaning appliance, personal care appliance", "家电类别"), | |
| 186 | - _taxonomy_field("power_voltage", "Power / Voltage", "wattage, voltage, plug type, or power supply when supported", "功率 / 电压"), | |
| 187 | - _taxonomy_field("capacity_coverage", "Capacity / Coverage", "capacity or coverage metric such as 1.5L, 20L, 40sqm", "容量 / 覆盖范围"), | |
| 188 | - _taxonomy_field("control_method", "Control Method", "operation method such as touch, knob, remote, app control", "控制方式"), | |
| 189 | - _taxonomy_field("installation_type", "Installation Type", "setup style such as countertop, handheld, portable, wall-mounted, built-in", "安装方式"), | |
| 190 | - _taxonomy_field("key_features", "Key Features", "main product features such as timer, steam, HEPA filter, self-cleaning", "关键特征"), | |
| 191 | - _taxonomy_field("material_finish", "Material / Finish", "main material or exterior finish when supported", "材质 / 表面处理"), | |
| 192 | - _taxonomy_field("color", "Color", "specific color name when available", "主颜色"), | |
| 193 | - _taxonomy_field("use_scenario", "Use Scenario", "intended use such as cooking, cleaning, grooming, cooling, air treatment", "使用场景"), | |
| 194 | -) | |
| 195 | - | |
| 196 | -HOME_LIVING_TAXONOMY_FIELDS = ( | |
| 197 | - _taxonomy_field("product_type", "Product Type", "concise home and living category label", "品类"), | |
| 198 | - _taxonomy_field("room_placement", "Room / Placement", "intended room or placement such as bedroom, kitchen, bathroom, desktop", "适用空间 / 摆放位置"), | |
| 199 | - _taxonomy_field("material", "Material", "main material such as wood, ceramic, cotton, glass, metal", "材质"), | |
| 200 | - _taxonomy_field("style", "Style", "home style such as modern, farmhouse, minimalist, boho, Nordic", "风格"), | |
| 201 | - _taxonomy_field("size_dimensions", "Size / Dimensions", "size or dimensions when stated", "尺寸 / 规格"), | |
| 202 | - _taxonomy_field("color", "Color", "specific color name when available", "主颜色"), | |
| 203 | - _taxonomy_field("pattern_finish", "Pattern / Finish", "surface pattern or finish such as solid, marble, matte, ribbed", "图案 / 表面处理"), | |
| 204 | - _taxonomy_field("key_features", "Key Features", "main product features such as stackable, washable, blackout, space-saving", "关键特征"), | |
| 205 | - _taxonomy_field("assembly_installation", "Assembly / Installation", "assembly or installation trait when supported", "组装 / 安装"), | |
| 206 | - _taxonomy_field("use_scenario", "Use Scenario", "intended use such as storage, dining, decor, sleep, organization", "使用场景"), | |
| 207 | -) | |
| 208 | - | |
| 209 | -WIGS_TAXONOMY_FIELDS = ( | |
| 210 | - _taxonomy_field("product_type", "Product Type", "concise wig or hairpiece category label", "品类"), | |
| 211 | - _taxonomy_field("hair_material", "Hair Material", "hair material such as human hair, synthetic fiber, heat-resistant fiber", "发丝材质"), | |
| 212 | - _taxonomy_field("hair_texture", "Hair Texture", "texture or curl pattern such as straight, body wave, curly, kinky", "发质纹理"), | |
| 213 | - _taxonomy_field("hair_length", "Hair Length", "hair length when stated", "发长"), | |
| 214 | - _taxonomy_field("hair_color", "Hair Color", "specific hair color or blend when available", "发色"), | |
| 215 | - _taxonomy_field("cap_construction", "Cap Construction", "cap type such as full lace, lace front, glueless, U part", "帽网结构"), | |
| 216 | - _taxonomy_field("lace_area_part_type", "Lace Area / Part Type", "lace size or part style such as 13x4 lace, middle part, T part", "蕾丝面积 / 分缝类型"), | |
| 217 | - _taxonomy_field("density_volume", "Density / Volume", "hair density or fullness when supported", "密度 / 发量"), | |
| 218 | - _taxonomy_field("style_bang_type", "Style / Bang Type", "style cue such as bob, pixie, layered, with bangs", "款式 / 刘海类型"), | |
| 219 | - _taxonomy_field("occasion_end_use", "Occasion / End Use", "intended use such as daily wear, cosplay, protective style, party", "适用场景"), | |
| 220 | -) | |
| 221 | - | |
| 222 | -BEAUTY_TAXONOMY_FIELDS = ( | |
| 223 | - _taxonomy_field("product_type", "Product Type", "concise beauty or cosmetics category label", "品类"), | |
| 224 | - _taxonomy_field("target_area", "Target Area", "target area such as face, lips, eyes, nails, hair, body", "适用部位"), | |
| 225 | - _taxonomy_field("skin_hair_type", "Skin Type / Hair Type", "suitable skin or hair type when supported", "肤质 / 发质"), | |
| 226 | - _taxonomy_field("finish_effect", "Finish / Effect", "cosmetic finish or effect such as matte, dewy, volumizing, brightening", "妆效 / 效果"), | |
| 227 | - _taxonomy_field("key_ingredients", "Key Ingredients", "notable ingredients when stated", "关键成分"), | |
| 228 | - _taxonomy_field("shade_color", "Shade / Color", "specific shade or color when available", "色号 / 颜色"), | |
| 229 | - _taxonomy_field("scent", "Scent", "fragrance or scent only when supported", "香味"), | |
| 230 | - _taxonomy_field("formulation", "Formulation", "product form such as cream, serum, powder, gel, stick", "剂型 / 形态"), | |
| 231 | - _taxonomy_field("functional_benefits", "Functional Benefits", "benefits such as hydration, anti-aging, long-wear, repair, sun protection", "功能"), | |
| 232 | - _taxonomy_field("use_scenario", "Use Scenario", "intended use such as daily routine, salon, travel, evening makeup", "使用场景"), | |
| 233 | -) | |
| 234 | - | |
| 235 | -ACCESSORIES_TAXONOMY_FIELDS = ( | |
| 236 | - _taxonomy_field("product_type", "Product Type", "concise accessory category label such as necklace, watch, belt, hat, or sunglasses", "品类"), | |
| 237 | - _taxonomy_field("target_gender", "Target Gender", "intended gender only if clearly implied", "目标性别"), | |
| 238 | - _taxonomy_field("material", "Material", "main material such as alloy, leather, stainless steel, acetate, fabric", "材质"), | |
| 239 | - _taxonomy_field("color", "Color", "specific color name when available", "主颜色"), | |
| 240 | - _taxonomy_field("pattern_finish", "Pattern / Finish", "surface treatment or style finish such as polished, textured, braided, rhinestone", "图案 / 表面处理"), | |
| 241 | - _taxonomy_field("closure_fastening", "Closure / Fastening", "fastening method when applicable", "闭合 / 固定方式"), | |
| 242 | - _taxonomy_field("size_fit", "Size / Fit", "size or fit information such as adjustable, one size, 42mm", "尺寸 / 适配"), | |
| 243 | - _taxonomy_field("style", "Style", "style cue such as minimalist, vintage, statement, sporty", "风格"), | |
| 244 | - _taxonomy_field("occasion_end_use", "Occasion / End Use", "likely use such as daily wear, formal, party, travel, sun protection", "适用场景"), | |
| 245 | - _taxonomy_field("set_pack_size", "Set / Pack Size", "set count or pack size when stated", "套装 / 规格"), | |
| 246 | -) | |
| 247 | - | |
| 248 | -TOYS_TAXONOMY_FIELDS = ( | |
| 249 | - _taxonomy_field("product_type", "Product Type", "concise toy category label", "品类"), | |
| 250 | - _taxonomy_field("age_group", "Age Group", "intended age group when clearly implied", "年龄段"), | |
| 251 | - _taxonomy_field("character_theme", "Character / Theme", "licensed character, theme, or play theme when supported", "角色 / 主题"), | |
| 252 | - _taxonomy_field("material", "Material", "main toy material such as plush, plastic, wood, silicone", "材质"), | |
| 253 | - _taxonomy_field("power_source", "Power Source", "battery, rechargeable, wind-up, or non-powered when supported", "供电方式"), | |
| 254 | - _taxonomy_field("interactive_features", "Interactive Features", "interactive functions such as sound, lights, remote control, motion", "互动功能"), | |
| 255 | - _taxonomy_field("educational_play_value", "Educational / Play Value", "play value such as STEM, pretend play, sensory, puzzle solving", "教育 / 可玩性"), | |
| 256 | - _taxonomy_field("piece_count_size", "Piece Count / Size", "piece count or size when stated", "件数 / 尺寸"), | |
| 257 | - _taxonomy_field("color", "Color", "specific color name when available", "主颜色"), | |
| 258 | - _taxonomy_field("use_scenario", "Use Scenario", "intended use such as indoor play, bath time, party favor, outdoor play", "使用场景"), | |
| 259 | -) | |
| 260 | - | |
| 261 | -SHOES_TAXONOMY_FIELDS = ( | |
| 262 | - _taxonomy_field("product_type", "Product Type", "concise footwear category label", "品类"), | |
| 263 | - _taxonomy_field("target_gender", "Target Gender", "intended gender only if clearly implied", "目标性别"), | |
| 264 | - _taxonomy_field("age_group", "Age Group", "only if clearly implied", "年龄段"), | |
| 265 | - _taxonomy_field("closure_type", "Closure Type", "fastening method such as lace-up, slip-on, buckle, hook-and-loop", "闭合方式"), | |
| 266 | - _taxonomy_field("toe_shape", "Toe Shape", "toe shape when applicable, e.g. round toe, pointed toe, open toe", "鞋头形状"), | |
| 267 | - _taxonomy_field("heel_sole_type", "Heel Height / Sole Type", "heel or sole profile such as flat, block heel, wedge, platform, thick sole", "跟高 / 鞋底类型"), | |
| 268 | - _taxonomy_field("upper_material", "Upper Material", "main upper material such as leather, knit, canvas, mesh", "鞋面材质"), | |
| 269 | - _taxonomy_field("lining_insole_material", "Lining / Insole Material", "lining or insole material when supported", "里料 / 鞋垫材质"), | |
| 270 | - _taxonomy_field("color", "Color", "specific color name when available", "主颜色"), | |
| 271 | - _taxonomy_field("occasion_end_use", "Occasion / End Use", "likely use such as running, casual, office, hiking, formal", "适用场景"), | |
| 272 | -) | |
| 273 | - | |
| 274 | -SPORTS_TAXONOMY_FIELDS = ( | |
| 275 | - _taxonomy_field("product_type", "Product Type", "concise sports product category label", "品类"), | |
| 276 | - _taxonomy_field("sport_activity", "Sport / Activity", "primary sport or activity such as fitness, yoga, basketball, cycling, swimming", "运动 / 活动"), | |
| 277 | - _taxonomy_field("skill_level", "Skill Level", "target user level when supported, e.g. beginner, training, professional", "适用水平"), | |
| 278 | - _taxonomy_field("material", "Material", "main material such as EVA, carbon fiber, neoprene, latex", "材质"), | |
| 279 | - _taxonomy_field("size_capacity", "Size / Capacity", "size, weight, resistance level, or capacity when stated", "尺寸 / 容量"), | |
| 280 | - _taxonomy_field("protection_support", "Protection / Support", "support or protection function such as ankle support, shock absorption, impact protection", "防护 / 支撑"), | |
| 281 | - _taxonomy_field("key_features", "Key Features", "main features such as anti-slip, adjustable, foldable, quick-dry", "关键特征"), | |
| 282 | - _taxonomy_field("power_source", "Power Source", "battery, electric, or non-powered when applicable", "供电方式"), | |
| 283 | - _taxonomy_field("color", "Color", "specific color name when available", "主颜色"), | |
| 284 | - _taxonomy_field("use_scenario", "Use Scenario", "intended use such as gym, home workout, field training, competition", "使用场景"), | |
| 285 | -) | |
| 286 | - | |
| 287 | -OTHERS_TAXONOMY_FIELDS = ( | |
| 288 | - _taxonomy_field("product_type", "Product Type", "concise product category label, not a full marketing title", "品类"), | |
| 289 | - _taxonomy_field("product_category", "Product Category", "broader retail grouping when the specific product type is narrow", "商品类别"), | |
| 290 | - _taxonomy_field("target_user", "Target User", "intended user, audience, or recipient when clearly implied", "适用人群"), | |
| 291 | - _taxonomy_field("material_ingredients", "Material / Ingredients", "main material or ingredients when supported", "材质 / 成分"), | |
| 292 | - _taxonomy_field("key_features", "Key Features", "primary product attributes or standout features", "关键特征"), | |
| 293 | - _taxonomy_field("functional_benefits", "Functional Benefits", "practical benefits or performance advantages when supported", "功能"), | |
| 294 | - _taxonomy_field("size_capacity", "Size / Capacity", "size, count, weight, or capacity when stated", "尺寸 / 容量"), | |
| 295 | - _taxonomy_field("color", "Color", "specific color name when available", "主颜色"), | |
| 296 | - _taxonomy_field("style_theme", "Style / Theme", "overall style, design theme, or visual direction when supported", "风格 / 主题"), | |
| 297 | - _taxonomy_field("use_scenario", "Use Scenario", "likely use occasion or application setting when supported", "使用场景"), | |
| 298 | -) | |
| 299 | - | |
| 300 | -CATEGORY_TAXONOMY_PROFILES: Dict[str, Dict[str, Any]] = { | |
| 301 | - "apparel": _make_taxonomy_profile( | |
| 302 | - "apparel", | |
| 303 | - APPAREL_TAXONOMY_FIELDS, | |
| 304 | - ), | |
| 305 | - "3c": _make_taxonomy_profile( | |
| 306 | - "3C", | |
| 307 | - THREE_C_TAXONOMY_FIELDS, | |
| 308 | - ), | |
| 309 | - "bags": _make_taxonomy_profile( | |
| 310 | - "bags", | |
| 311 | - BAGS_TAXONOMY_FIELDS, | |
| 312 | - ), | |
| 313 | - "pet_supplies": _make_taxonomy_profile( | |
| 314 | - "pet supplies", | |
| 315 | - PET_SUPPLIES_TAXONOMY_FIELDS, | |
| 316 | - ), | |
| 317 | - "electronics": _make_taxonomy_profile( | |
| 318 | - "electronics", | |
| 319 | - ELECTRONICS_TAXONOMY_FIELDS, | |
| 320 | - ), | |
| 321 | - "outdoor": _make_taxonomy_profile( | |
| 322 | - "outdoor products", | |
| 323 | - OUTDOOR_TAXONOMY_FIELDS, | |
| 324 | - ), | |
| 325 | - "home_appliances": _make_taxonomy_profile( | |
| 326 | - "home appliances", | |
| 327 | - HOME_APPLIANCES_TAXONOMY_FIELDS, | |
| 328 | - ), | |
| 329 | - "home_living": _make_taxonomy_profile( | |
| 330 | - "home and living", | |
| 331 | - HOME_LIVING_TAXONOMY_FIELDS, | |
| 332 | - ), | |
| 333 | - "wigs": _make_taxonomy_profile( | |
| 334 | - "wigs", | |
| 335 | - WIGS_TAXONOMY_FIELDS, | |
| 336 | - ), | |
| 337 | - "beauty": _make_taxonomy_profile( | |
| 338 | - "beauty and cosmetics", | |
| 339 | - BEAUTY_TAXONOMY_FIELDS, | |
| 340 | - ), | |
| 341 | - "accessories": _make_taxonomy_profile( | |
| 342 | - "accessories", | |
| 343 | - ACCESSORIES_TAXONOMY_FIELDS, | |
| 344 | - ), | |
| 345 | - "toys": _make_taxonomy_profile( | |
| 346 | - "toys", | |
| 347 | - TOYS_TAXONOMY_FIELDS, | |
| 348 | - ), | |
| 349 | - "shoes": _make_taxonomy_profile( | |
| 350 | - "shoes", | |
| 351 | - SHOES_TAXONOMY_FIELDS, | |
| 352 | - ), | |
| 353 | - "sports": _make_taxonomy_profile( | |
| 354 | - "sports products", | |
| 355 | - SPORTS_TAXONOMY_FIELDS, | |
| 356 | - ), | |
| 357 | - "others": _make_taxonomy_profile( | |
| 358 | - "general merchandise", | |
| 359 | - OTHERS_TAXONOMY_FIELDS, | |
| 360 | - ), | |
| 361 | -} | |
| 362 | - | |
| 363 | -TAXONOMY_SHARED_ANALYSIS_INSTRUCTION = CATEGORY_TAXONOMY_PROFILES["apparel"]["shared_instruction"] | |
| 364 | -TAXONOMY_MARKDOWN_TABLE_HEADERS_EN = CATEGORY_TAXONOMY_PROFILES["apparel"]["markdown_table_headers"]["en"] | |
| 365 | -TAXONOMY_LANGUAGE_MARKDOWN_TABLE_HEADERS: Dict[str, Dict[str, Any]] = CATEGORY_TAXONOMY_PROFILES["apparel"]["markdown_table_headers"] | |
| 366 | - | |
| 367 | -LANGUAGE_MARKDOWN_TABLE_HEADERS: Dict[str, Dict[str, Any]] = { | |
| 368 | - "en": [ | |
| 369 | - "No.", | |
| 370 | - "Product title", | |
| 371 | - "Category path", | |
| 372 | - "Fine-grained tags", | |
| 373 | - "Target audience", | |
| 374 | - "Usage scene", | |
| 375 | - "Season", | |
| 376 | - "Key attributes", | |
| 377 | - "Material", | |
| 378 | - "Features", | |
| 379 | - "Anchor text" | |
| 380 | - ], | |
| 381 | - "zh": [ | |
| 382 | - "序号", | |
| 383 | - "商品标题", | |
| 384 | - "品类路径", | |
| 385 | - "细分标签", | |
| 386 | - "适用人群", | |
| 387 | - "使用场景", | |
| 388 | - "适用季节", | |
| 389 | - "关键属性", | |
| 390 | - "材质说明", | |
| 391 | - "功能特点", | |
| 392 | - "锚文本" | |
| 393 | - ], | |
| 394 | - "zh_tw": [ | |
| 395 | - "序號", | |
| 396 | - "商品標題", | |
| 397 | - "品類路徑", | |
| 398 | - "細分標籤", | |
| 399 | - "適用人群", | |
| 400 | - "使用場景", | |
| 401 | - "適用季節", | |
| 402 | - "關鍵屬性", | |
| 403 | - "材質說明", | |
| 404 | - "功能特點", | |
| 405 | - "錨文本" | |
| 406 | - ], | |
| 407 | - "ru": [ | |
| 408 | - "№", | |
| 409 | - "Название товара", | |
| 410 | - "Путь категории", | |
| 411 | - "Детализированные теги", | |
| 412 | - "Целевая аудитория", | |
| 413 | - "Сценарий использования", | |
| 414 | - "Сезон", | |
| 415 | - "Ключевые атрибуты", | |
| 416 | - "Материал", | |
| 417 | - "Особенности", | |
| 418 | - "Анкорный текст" | |
| 419 | - ], | |
| 420 | - "ja": [ | |
| 421 | - "番号", | |
| 422 | - "商品タイトル", | |
| 423 | - "カテゴリパス", | |
| 424 | - "詳細タグ", | |
| 425 | - "対象ユーザー", | |
| 426 | - "利用シーン", | |
| 427 | - "季節", | |
| 428 | - "主要属性", | |
| 429 | - "素材", | |
| 430 | - "機能特徴", | |
| 431 | - "アンカーテキスト" | |
| 432 | - ], | |
| 433 | - "ko": [ | |
| 434 | - "번호", | |
| 435 | - "상품 제목", | |
| 436 | - "카테고리 경로", | |
| 437 | - "세부 태그", | |
| 438 | - "대상 고객", | |
| 439 | - "사용 장면", | |
| 440 | - "계절", | |
| 441 | - "핵심 속성", | |
| 442 | - "소재", | |
| 443 | - "기능 특징", | |
| 444 | - "앵커 텍스트" | |
| 445 | - ], | |
| 446 | - "es": [ | |
| 447 | - "N.º", | |
| 448 | - "Titulo del producto", | |
| 449 | - "Ruta de categoria", | |
| 450 | - "Etiquetas detalladas", | |
| 451 | - "Publico objetivo", | |
| 452 | - "Escenario de uso", | |
| 453 | - "Temporada", | |
| 454 | - "Atributos clave", | |
| 455 | - "Material", | |
| 456 | - "Caracteristicas", | |
| 457 | - "Texto ancla" | |
| 458 | - ], | |
| 459 | - "fr": [ | |
| 460 | - "N°", | |
| 461 | - "Titre du produit", | |
| 462 | - "Chemin de categorie", | |
| 463 | - "Etiquettes detaillees", | |
| 464 | - "Public cible", | |
| 465 | - "Scenario d'utilisation", | |
| 466 | - "Saison", | |
| 467 | - "Attributs cles", | |
| 468 | - "Matiere", | |
| 469 | - "Caracteristiques", | |
| 470 | - "Texte d'ancrage" | |
| 471 | - ], | |
| 472 | - "pt": [ | |
| 473 | - "Nº", | |
| 474 | - "Titulo do produto", | |
| 475 | - "Caminho da categoria", | |
| 476 | - "Tags detalhadas", | |
| 477 | - "Publico-alvo", | |
| 478 | - "Cenario de uso", | |
| 479 | - "Estacao", | |
| 480 | - "Atributos principais", | |
| 481 | - "Material", | |
| 482 | - "Caracteristicas", | |
| 483 | - "Texto ancora" | |
| 484 | - ], | |
| 485 | - "de": [ | |
| 486 | - "Nr.", | |
| 487 | - "Produkttitel", | |
| 488 | - "Kategoriepfad", | |
| 489 | - "Detaillierte Tags", | |
| 490 | - "Zielgruppe", | |
| 491 | - "Nutzungsszenario", | |
| 492 | - "Saison", | |
| 493 | - "Wichtige Attribute", | |
| 494 | - "Material", | |
| 495 | - "Funktionen", | |
| 496 | - "Ankertext" | |
| 497 | - ], | |
| 498 | - "it": [ | |
| 499 | - "N.", | |
| 500 | - "Titolo del prodotto", | |
| 501 | - "Percorso categoria", | |
| 502 | - "Tag dettagliati", | |
| 503 | - "Pubblico target", | |
| 504 | - "Scenario d'uso", | |
| 505 | - "Stagione", | |
| 506 | - "Attributi chiave", | |
| 507 | - "Materiale", | |
| 508 | - "Caratteristiche", | |
| 509 | - "Testo ancora" | |
| 510 | - ], | |
| 511 | - "th": [ | |
| 512 | - "ลำดับ", | |
| 513 | - "ชื่อสินค้า", | |
| 514 | - "เส้นทางหมวดหมู่", | |
| 515 | - "แท็กย่อย", | |
| 516 | - "กลุ่มเป้าหมาย", | |
| 517 | - "สถานการณ์การใช้งาน", | |
| 518 | - "ฤดูกาล", | |
| 519 | - "คุณสมบัติสำคัญ", | |
| 520 | - "วัสดุ", | |
| 521 | - "คุณสมบัติการใช้งาน", | |
| 522 | - "แองเคอร์เท็กซ์" | |
| 523 | - ], | |
| 524 | - "vi": [ | |
| 525 | - "STT", | |
| 526 | - "Tieu de san pham", | |
| 527 | - "Duong dan danh muc", | |
| 528 | - "The chi tiet", | |
| 529 | - "Doi tuong phu hop", | |
| 530 | - "Boi canh su dung", | |
| 531 | - "Mua phu hop", | |
| 532 | - "Thuoc tinh chinh", | |
| 533 | - "Chat lieu", | |
| 534 | - "Tinh nang", | |
| 535 | - "Van ban neo" | |
| 536 | - ], | |
| 537 | - "id": [ | |
| 538 | - "No.", | |
| 539 | - "Judul produk", | |
| 540 | - "Jalur kategori", | |
| 541 | - "Tag terperinci", | |
| 542 | - "Target pengguna", | |
| 543 | - "Skenario penggunaan", | |
| 544 | - "Musim", | |
| 545 | - "Atribut utama", | |
| 546 | - "Bahan", | |
| 547 | - "Fitur", | |
| 548 | - "Teks jangkar" | |
| 549 | - ], | |
| 550 | - "ms": [ | |
| 551 | - "No.", | |
| 552 | - "Tajuk produk", | |
| 553 | - "Laluan kategori", | |
| 554 | - "Tag terperinci", | |
| 555 | - "Sasaran pengguna", | |
| 556 | - "Senario penggunaan", | |
| 557 | - "Musim", | |
| 558 | - "Atribut utama", | |
| 559 | - "Bahan", | |
| 560 | - "Ciri-ciri", | |
| 561 | - "Teks sauh" | |
| 562 | - ], | |
| 563 | - "ar": [ | |
| 564 | - "الرقم", | |
| 565 | - "عنوان المنتج", | |
| 566 | - "مسار الفئة", | |
| 567 | - "الوسوم التفصيلية", | |
| 568 | - "الفئة المستهدفة", | |
| 569 | - "سيناريو الاستخدام", | |
| 570 | - "الموسم", | |
| 571 | - "السمات الرئيسية", | |
| 572 | - "المادة", | |
| 573 | - "الميزات", | |
| 574 | - "نص الربط" | |
| 575 | - ], | |
| 576 | - "hi": [ | |
| 577 | - "क्रमांक", | |
| 578 | - "उत्पाद शीर्षक", | |
| 579 | - "श्रेणी पथ", | |
| 580 | - "विस्तृत टैग", | |
| 581 | - "लक्षित उपभोक्ता", | |
| 582 | - "उपयोग परिदृश्य", | |
| 583 | - "मौसम", | |
| 584 | - "मुख्य गुण", | |
| 585 | - "सामग्री", | |
| 586 | - "विशेषताएं", | |
| 587 | - "एंकर टेक्स्ट" | |
| 588 | - ], | |
| 589 | - "he": [ | |
| 590 | - "מס׳", | |
| 591 | - "כותרת המוצר", | |
| 592 | - "נתיב קטגוריה", | |
| 593 | - "תגיות מפורטות", | |
| 594 | - "קהל יעד", | |
| 595 | - "תרחיש שימוש", | |
| 596 | - "עונה", | |
| 597 | - "מאפיינים מרכזיים", | |
| 598 | - "חומר", | |
| 599 | - "תכונות", | |
| 600 | - "טקסט עוגן" | |
| 601 | - ], | |
| 602 | - "my": [ | |
| 603 | - "အမှတ်စဉ်", | |
| 604 | - "ကုန်ပစ္စည်းခေါင်းစဉ်", | |
| 605 | - "အမျိုးအစားလမ်းကြောင်း", | |
| 606 | - "အသေးစိတ်တဂ်များ", | |
| 607 | - "ပစ်မှတ်အသုံးပြုသူ", | |
| 608 | - "အသုံးပြုမှုအခြေအနေ", | |
| 609 | - "ရာသီ", | |
| 610 | - "အဓိကဂုဏ်သတ္တိများ", | |
| 611 | - "ပစ္စည်း", | |
| 612 | - "လုပ်ဆောင်ချက်များ", | |
| 613 | - "အန်ကာစာသား" | |
| 614 | - ], | |
| 615 | - "ta": [ | |
| 616 | - "எண்", | |
| 617 | - "தயாரிப்பு தலைப்பு", | |
| 618 | - "வகை பாதை", | |
| 619 | - "விரிவான குறிச்சொற்கள்", | |
| 620 | - "இலக்கு பயனர்கள்", | |
| 621 | - "பயன்பாட்டு நிலை", | |
| 622 | - "பருவம்", | |
| 623 | - "முக்கிய பண்புகள்", | |
| 624 | - "பொருள்", | |
| 625 | - "அம்சங்கள்", | |
| 626 | - "ஆங்கர் உரை" | |
| 627 | - ], | |
| 628 | - "ur": [ | |
| 629 | - "نمبر", | |
| 630 | - "پروڈکٹ عنوان", | |
| 631 | - "زمرہ راستہ", | |
| 632 | - "تفصیلی ٹیگز", | |
| 633 | - "ہدف صارفین", | |
| 634 | - "استعمال کا منظر", | |
| 635 | - "موسم", | |
| 636 | - "کلیدی خصوصیات", | |
| 637 | - "مواد", | |
| 638 | - "فیچرز", | |
| 639 | - "اینکر ٹیکسٹ" | |
| 640 | - ], | |
| 641 | - "bn": [ | |
| 642 | - "ক্রম", | |
| 643 | - "পণ্যের শিরোনাম", | |
| 644 | - "শ্রেণি পথ", | |
| 645 | - "বিস্তারিত ট্যাগ", | |
| 646 | - "লক্ষ্য ব্যবহারকারী", | |
| 647 | - "ব্যবহারের দৃশ্য", | |
| 648 | - "মৌসুম", | |
| 649 | - "মূল বৈশিষ্ট্য", | |
| 650 | - "উপাদান", | |
| 651 | - "ফিচার", | |
| 652 | - "অ্যাঙ্কর টেক্সট" | |
| 653 | - ], | |
| 654 | - "pl": [ | |
| 655 | - "Nr", | |
| 656 | - "Tytul produktu", | |
| 657 | - "Sciezka kategorii", | |
| 658 | - "Szczegolowe tagi", | |
| 659 | - "Grupa docelowa", | |
| 660 | - "Scenariusz uzycia", | |
| 661 | - "Sezon", | |
| 662 | - "Kluczowe atrybuty", | |
| 663 | - "Material", | |
| 664 | - "Cechy", | |
| 665 | - "Tekst kotwicy" | |
| 666 | - ], | |
| 667 | - "nl": [ | |
| 668 | - "Nr.", | |
| 669 | - "Producttitel", | |
| 670 | - "Categoriepad", | |
| 671 | - "Gedetailleerde tags", | |
| 672 | - "Doelgroep", | |
| 673 | - "Gebruikscontext", | |
| 674 | - "Seizoen", | |
| 675 | - "Belangrijke kenmerken", | |
| 676 | - "Materiaal", | |
| 677 | - "Functies", | |
| 678 | - "Ankertekst" | |
| 679 | - ], | |
| 680 | - "ro": [ | |
| 681 | - "Nr.", | |
| 682 | - "Titlul produsului", | |
| 683 | - "Calea categoriei", | |
| 684 | - "Etichete detaliate", | |
| 685 | - "Public tinta", | |
| 686 | - "Scenariu de utilizare", | |
| 687 | - "Sezon", | |
| 688 | - "Atribute cheie", | |
| 689 | - "Material", | |
| 690 | - "Caracteristici", | |
| 691 | - "Text ancora" | |
| 692 | - ], | |
| 693 | - "tr": [ | |
| 694 | - "No.", | |
| 695 | - "Urun basligi", | |
| 696 | - "Kategori yolu", | |
| 697 | - "Ayrintili etiketler", | |
| 698 | - "Hedef kitle", | |
| 699 | - "Kullanim senaryosu", | |
| 700 | - "Sezon", | |
| 701 | - "Temel ozellikler", | |
| 702 | - "Malzeme", | |
| 703 | - "Ozellikler", | |
| 704 | - "Capa metni" | |
| 705 | - ], | |
| 706 | - "km": [ | |
| 707 | - "ល.រ", | |
| 708 | - "ចំណងជើងផលិតផល", | |
| 709 | - "ផ្លូវប្រភេទ", | |
| 710 | - "ស្លាកលម្អិត", | |
| 711 | - "ក្រុមអ្នកប្រើគោលដៅ", | |
| 712 | - "សេណារីយ៉ូប្រើប្រាស់", | |
| 713 | - "រដូវកាល", | |
| 714 | - "លក្ខណៈសម្បត្តិសំខាន់", | |
| 715 | - "សម្ភារៈ", | |
| 716 | - "មុខងារ", | |
| 717 | - "អត្ថបទអង់ក័រ" | |
| 718 | - ], | |
| 719 | - "lo": [ | |
| 720 | - "ລຳດັບ", | |
| 721 | - "ຊື່ສິນຄ້າ", | |
| 722 | - "ເສັ້ນທາງໝວດໝູ່", | |
| 723 | - "ແທັກລະອຽດ", | |
| 724 | - "ກຸ່ມເປົ້າໝາຍ", | |
| 725 | - "ສະຖານະການໃຊ້ງານ", | |
| 726 | - "ລະດູການ", | |
| 727 | - "ຄຸນລັກສະນະສຳຄັນ", | |
| 728 | - "ວັດສະດຸ", | |
| 729 | - "ຄຸນສົມບັດ", | |
| 730 | - "ຂໍ້ຄວາມອັງເຄີ" | |
| 731 | - ], | |
| 732 | - "yue": [ | |
| 733 | - "序號", | |
| 734 | - "商品標題", | |
| 735 | - "品類路徑", | |
| 736 | - "細分類標籤", | |
| 737 | - "適用人群", | |
| 738 | - "使用場景", | |
| 739 | - "適用季節", | |
| 740 | - "關鍵屬性", | |
| 741 | - "材質說明", | |
| 742 | - "功能特點", | |
| 743 | - "錨文本" | |
| 744 | - ], | |
| 745 | - "cs": [ | |
| 746 | - "C.", | |
| 747 | - "Nazev produktu", | |
| 748 | - "Cesta kategorie", | |
| 749 | - "Podrobne stitky", | |
| 750 | - "Cilova skupina", | |
| 751 | - "Scenar pouziti", | |
| 752 | - "Sezona", | |
| 753 | - "Klicove atributy", | |
| 754 | - "Material", | |
| 755 | - "Vlastnosti", | |
| 756 | - "Kotvici text" | |
| 757 | - ], | |
| 758 | - "el": [ | |
| 759 | - "Α/Α", | |
| 760 | - "Τίτλος προϊόντος", | |
| 761 | - "Διαδρομή κατηγορίας", | |
| 762 | - "Αναλυτικές ετικέτες", | |
| 763 | - "Κοινό-στόχος", | |
| 764 | - "Σενάριο χρήσης", | |
| 765 | - "Εποχή", | |
| 766 | - "Βασικά χαρακτηριστικά", | |
| 767 | - "Υλικό", | |
| 768 | - "Λειτουργίες", | |
| 769 | - "Κείμενο άγκυρας" | |
| 770 | - ], | |
| 771 | - "sv": [ | |
| 772 | - "Nr", | |
| 773 | - "Produkttitel", | |
| 774 | - "Kategorisokvag", | |
| 775 | - "Detaljerade taggar", | |
| 776 | - "Malgrupp", | |
| 777 | - "Anvandningsscenario", | |
| 778 | - "Sasong", | |
| 779 | - "Viktiga attribut", | |
| 780 | - "Material", | |
| 781 | - "Funktioner", | |
| 782 | - "Ankartext" | |
| 783 | - ], | |
| 784 | - "hu": [ | |
| 785 | - "Sorszam", | |
| 786 | - "Termekcim", | |
| 787 | - "Kategoriavonal", | |
| 788 | - "Reszletes cimkek", | |
| 789 | - "Celcsoport", | |
| 790 | - "Hasznalati helyzet", | |
| 791 | - "Evszak", | |
| 792 | - "Fo jellemzok", | |
| 793 | - "Anyag", | |
| 794 | - "Funkciok", | |
| 795 | - "Horgonyszoveg" | |
| 796 | - ], | |
| 797 | - "da": [ | |
| 798 | - "Nr.", | |
| 799 | - "Produkttitel", | |
| 800 | - "Kategoristi", | |
| 801 | - "Detaljerede tags", | |
| 802 | - "Malgruppe", | |
| 803 | - "Brugsscenarie", | |
| 804 | - "Saeson", | |
| 805 | - "Nogleattributter", | |
| 806 | - "Materiale", | |
| 807 | - "Funktioner", | |
| 808 | - "Ankertekst" | |
| 809 | - ], | |
| 810 | - "fi": [ | |
| 811 | - "Nro", | |
| 812 | - "Tuotteen nimi", | |
| 813 | - "Kategoriapolku", | |
| 814 | - "Yksityiskohtaiset tunnisteet", | |
| 815 | - "Kohdeyleiso", | |
| 816 | - "Kayttotilanne", | |
| 817 | - "Kausi", | |
| 818 | - "Keskeiset ominaisuudet", | |
| 819 | - "Materiaali", | |
| 820 | - "Ominaisuudet", | |
| 821 | - "Ankkuriteksti" | |
| 822 | - ], | |
| 823 | - "uk": [ | |
| 824 | - "№", | |
| 825 | - "Назва товару", | |
| 826 | - "Шлях категорії", | |
| 827 | - "Детальні теги", | |
| 828 | - "Цільова аудиторія", | |
| 829 | - "Сценарій використання", | |
| 830 | - "Сезон", | |
| 831 | - "Ключові атрибути", | |
| 832 | - "Матеріал", | |
| 833 | - "Особливості", | |
| 834 | - "Анкорний текст" | |
| 835 | - ], | |
| 836 | - "bg": [ | |
| 837 | - "№", | |
| 838 | - "Заглавие на продукта", | |
| 839 | - "Път на категорията", | |
| 840 | - "Подробни тагове", | |
| 841 | - "Целева аудитория", | |
| 842 | - "Сценарий на употреба", | |
| 843 | - "Сезон", | |
| 844 | - "Ключови атрибути", | |
| 845 | - "Материал", | |
| 846 | - "Характеристики", | |
| 847 | - "Анкор текст" | |
| 848 | - ] | |
| 849 | -} |
indexer/product_enrich模块说明.md deleted
| ... | ... | @@ -1,173 +0,0 @@ |
| 1 | -# 内容富化模块说明 | |
| 2 | - | |
| 3 | -本文说明商品内容富化模块的职责、入口、输出结构,以及当前 taxonomy profile 的设计约束。 | |
| 4 | - | |
| 5 | -## 1. 模块目标 | |
| 6 | - | |
| 7 | -内容富化模块负责基于商品文本调用 LLM,生成以下索引字段: | |
| 8 | - | |
| 9 | -- `qanchors` | |
| 10 | -- `enriched_tags` | |
| 11 | -- `enriched_attributes` | |
| 12 | -- `enriched_taxonomy_attributes` | |
| 13 | - | |
| 14 | -模块追求的设计原则: | |
| 15 | - | |
| 16 | -- 单一职责:只负责内容理解与结构化输出,不负责 CSV 读写 | |
| 17 | -- 输出对齐 ES mapping:返回结构可直接写入 `search_products` | |
| 18 | -- 配置化扩展:taxonomy profile 通过数据配置扩展,而不是散落条件分支 | |
| 19 | -- 代码精简:只面向正常使用方式,避免为了不合理调用堆叠补丁逻辑 | |
| 20 | - | |
| 21 | -## 2. 主要文件 | |
| 22 | - | |
| 23 | -- [product_enrich.py](/data/saas-search/indexer/product_enrich.py) | |
| 24 | - 运行时主逻辑,负责批处理、缓存、prompt 组装、LLM 调用、markdown 解析、输出整理 | |
| 25 | -- [product_enrich_prompts.py](/data/saas-search/indexer/product_enrich_prompts.py) | |
| 26 | - prompt 模板与 taxonomy profile 配置 | |
| 27 | -- [document_transformer.py](/data/saas-search/indexer/document_transformer.py) | |
| 28 | - 在内部索引构建链路中调用内容富化模块,把结果回填到 ES doc | |
| 29 | -- [taxonomy.md](/data/saas-search/indexer/taxonomy.md) | |
| 30 | - taxonomy 设计说明与字段清单 | |
| 31 | - | |
| 32 | -## 3. 对外入口 | |
| 33 | - | |
| 34 | -### 3.1 Python 入口 | |
| 35 | - | |
| 36 | -核心入口: | |
| 37 | - | |
| 38 | -```python | |
| 39 | -build_index_content_fields( | |
| 40 | - items, | |
| 41 | - tenant_id=None, | |
| 42 | - enrichment_scopes=None, | |
| 43 | - category_taxonomy_profile=None, | |
| 44 | -) | |
| 45 | -``` | |
| 46 | - | |
| 47 | -输入最小要求: | |
| 48 | - | |
| 49 | -- `id` 或 `spu_id` | |
| 50 | -- `title` | |
| 51 | - | |
| 52 | -可选输入: | |
| 53 | - | |
| 54 | -- `brief` | |
| 55 | -- `description` | |
| 56 | -- `image_url` | |
| 57 | - | |
| 58 | -关键参数: | |
| 59 | - | |
| 60 | -- `enrichment_scopes` | |
| 61 | - 可选 `generic`、`category_taxonomy` | |
| 62 | -- `category_taxonomy_profile` | |
| 63 | - taxonomy profile;默认 `apparel` | |
| 64 | - | |
| 65 | -### 3.2 HTTP 入口 | |
| 66 | - | |
| 67 | -API 路由: | |
| 68 | - | |
| 69 | -- `POST /indexer/enrich-content` | |
| 70 | - | |
| 71 | -对应文档: | |
| 72 | - | |
| 73 | -- [搜索API对接指南-05-索引接口(Indexer)](/data/saas-search/docs/搜索API对接指南-05-索引接口(Indexer).md) | |
| 74 | -- [搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation)](/data/saas-search/docs/搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md) | |
| 75 | - | |
| 76 | -## 4. 输出结构 | |
| 77 | - | |
| 78 | -返回结果与 ES mapping 对齐: | |
| 79 | - | |
| 80 | -```json | |
| 81 | -{ | |
| 82 | - "id": "223167", | |
| 83 | - "qanchors": { | |
| 84 | - "zh": ["短袖T恤", "纯棉"], | |
| 85 | - "en": ["t-shirt", "cotton"] | |
| 86 | - }, | |
| 87 | - "enriched_tags": { | |
| 88 | - "zh": ["短袖", "纯棉"], | |
| 89 | - "en": ["short sleeve", "cotton"] | |
| 90 | - }, | |
| 91 | - "enriched_attributes": [ | |
| 92 | - { | |
| 93 | - "name": "enriched_tags", | |
| 94 | - "value": { | |
| 95 | - "zh": ["短袖", "纯棉"], | |
| 96 | - "en": ["short sleeve", "cotton"] | |
| 97 | - } | |
| 98 | - } | |
| 99 | - ], | |
| 100 | - "enriched_taxonomy_attributes": [ | |
| 101 | - { | |
| 102 | - "name": "Product Type", | |
| 103 | - "value": { | |
| 104 | - "zh": ["T恤"], | |
| 105 | - "en": ["t-shirt"] | |
| 106 | - } | |
| 107 | - } | |
| 108 | - ] | |
| 109 | -} | |
| 110 | -``` | |
| 111 | - | |
| 112 | -说明: | |
| 113 | - | |
| 114 | -- `generic` 部分固定输出核心索引语言 `zh`、`en` | |
| 115 | -- `taxonomy` 部分同样统一输出 `zh`、`en` | |
| 116 | - | |
| 117 | -## 5. Taxonomy profile | |
| 118 | - | |
| 119 | -当前支持: | |
| 120 | - | |
| 121 | -- `apparel` | |
| 122 | -- `3c` | |
| 123 | -- `bags` | |
| 124 | -- `pet_supplies` | |
| 125 | -- `electronics` | |
| 126 | -- `outdoor` | |
| 127 | -- `home_appliances` | |
| 128 | -- `home_living` | |
| 129 | -- `wigs` | |
| 130 | -- `beauty` | |
| 131 | -- `accessories` | |
| 132 | -- `toys` | |
| 133 | -- `shoes` | |
| 134 | -- `sports` | |
| 135 | -- `others` | |
| 136 | - | |
| 137 | -统一约束: | |
| 138 | - | |
| 139 | -- 所有 profile 都返回 `zh` + `en` | |
| 140 | -- profile 只决定 taxonomy 字段集合,不再决定输出语言 | |
| 141 | -- 所有 profile 都配置中英文字段名,prompt/header 结构保持一致 | |
| 142 | - | |
| 143 | -## 6. 内部索引链路的当前约束 | |
| 144 | - | |
| 145 | -在内部 ES 文档构建链路里,`document_transformer` 当前调用内容富化时,taxonomy profile 暂时固定传: | |
| 146 | - | |
| 147 | -```python | |
| 148 | -category_taxonomy_profile="apparel" | |
| 149 | -``` | |
| 150 | - | |
| 151 | -这是一种显式、可控、代码更干净的临时策略。 | |
| 152 | - | |
| 153 | -当前代码里已保留 TODO: | |
| 154 | - | |
| 155 | -- 后续从数据库读取租户真实所属行业 | |
| 156 | -- 再用该行业替换固定的 `apparel` | |
| 157 | - | |
| 158 | -当前不做“根据商品类目文本自动猜 profile”的隐式逻辑,避免增加冗余代码与不必要的不确定性。 | |
| 159 | - | |
| 160 | -## 7. 缓存与批处理 | |
| 161 | - | |
| 162 | -缓存键由以下信息共同决定: | |
| 163 | - | |
| 164 | -- `analysis_kind` | |
| 165 | -- `target_lang` | |
| 166 | -- prompt/schema 版本指纹 | |
| 167 | -- prompt 实际输入文本 | |
| 168 | - | |
| 169 | -批处理规则: | |
| 170 | - | |
| 171 | -- 单次 LLM 调用最多 20 条 | |
| 172 | -- 上层允许传更大批次,模块内部自动拆批 | |
| 173 | -- uncached batch 可并发执行 |
indexer/spu_transformer.py
| ... | ... | @@ -220,7 +220,6 @@ class SPUTransformer: |
| 220 | 220 | logger.info(f"Grouped options into {len(option_groups)} SPU groups") |
| 221 | 221 | |
| 222 | 222 | documents: List[Dict[str, Any]] = [] |
| 223 | - doc_spu_rows: List[pd.Series] = [] | |
| 224 | 223 | skipped_count = 0 |
| 225 | 224 | error_count = 0 |
| 226 | 225 | |
| ... | ... | @@ -244,11 +243,9 @@ class SPUTransformer: |
| 244 | 243 | spu_row=spu_row, |
| 245 | 244 | skus=skus, |
| 246 | 245 | options=options, |
| 247 | - fill_llm_attributes=False, | |
| 248 | 246 | ) |
| 249 | 247 | if doc: |
| 250 | 248 | documents.append(doc) |
| 251 | - doc_spu_rows.append(spu_row) | |
| 252 | 249 | else: |
| 253 | 250 | skipped_count += 1 |
| 254 | 251 | logger.warning(f"SPU {spu_id} transformation returned None, skipped") |
| ... | ... | @@ -256,13 +253,6 @@ class SPUTransformer: |
| 256 | 253 | error_count += 1 |
| 257 | 254 | logger.error(f"Error transforming SPU {spu_id}: {e}", exc_info=True) |
| 258 | 255 | |
| 259 | - # 批量填充 LLM 字段(尽量攒批,每次最多 20 条;失败仅 warning,不影响主流程) | |
| 260 | - try: | |
| 261 | - if documents and doc_spu_rows: | |
| 262 | - self.document_transformer.fill_llm_attributes_batch(documents, doc_spu_rows) | |
| 263 | - except Exception as e: | |
| 264 | - logger.warning("Batch LLM fill failed in transform_batch: %s", e) | |
| 265 | - | |
| 266 | 256 | logger.info(f"Transformation complete:") |
| 267 | 257 | logger.info(f" - Total SPUs: {len(spu_df)}") |
| 268 | 258 | logger.info(f" - Successfully transformed: {len(documents)}") |
| ... | ... | @@ -270,5 +260,3 @@ class SPUTransformer: |
| 270 | 260 | logger.info(f" - Errors: {error_count}") |
| 271 | 261 | |
| 272 | 262 | return documents |
| 273 | - | |
| 274 | - | ... | ... |
scripts/debug/trace_indexer_calls.sh
| ... | ... | @@ -66,7 +66,7 @@ echo "" |
| 66 | 66 | echo " - Indexer 内部会调用:" |
| 67 | 67 | echo " - Text Embedding 服务 (${EMBEDDING_TEXT_PORT}): POST /embed/text" |
| 68 | 68 | echo " - Image Embedding 服务 (${EMBEDDING_IMAGE_PORT}): POST /embed/image" |
| 69 | -echo " - Qwen API: dashscope.aliyuncs.com (翻译、LLM 分析)" | |
| 69 | +echo " - Translation 服务 / 翻译后端(按当前配置)" | |
| 70 | 70 | echo " - MySQL: 商品数据" |
| 71 | 71 | echo " - Elasticsearch: 写入索引" |
| 72 | 72 | echo "" | ... | ... |
scripts/redis/redis_cache_health_check.py
| ... | ... | @@ -2,7 +2,7 @@ |
| 2 | 2 | """ |
| 3 | 3 | 缓存状态巡检脚本 |
| 4 | 4 | |
| 5 | -按「缓存类型」维度(embedding / translation / anchors)查看: | |
| 5 | +按「缓存类型」维度(embedding / translation)查看: | |
| 6 | 6 | - 估算 key 数量 |
| 7 | 7 | - TTL 分布(采样) |
| 8 | 8 | - 近期活跃 key(按 IDLETIME 近似) |
| ... | ... | @@ -10,12 +10,12 @@ |
| 10 | 10 | |
| 11 | 11 | 使用示例: |
| 12 | 12 | |
| 13 | - # 默认:检查已知三类缓存,使用 env_config 中的 Redis 配置 | |
| 13 | + # 默认:检查已知两类缓存,使用 env_config 中的 Redis 配置 | |
| 14 | 14 | python scripts/redis/redis_cache_health_check.py |
| 15 | 15 | |
| 16 | 16 | # 只看某一类缓存 |
| 17 | 17 | python scripts/redis/redis_cache_health_check.py --type embedding |
| 18 | - python scripts/redis/redis_cache_health_check.py --type translation anchors | |
| 18 | + python scripts/redis/redis_cache_health_check.py --type translation | |
| 19 | 19 | |
| 20 | 20 | # 自定义前缀(pattern),不限定缓存类型 |
| 21 | 21 | python scripts/redis/redis_cache_health_check.py --pattern "mycache:*" |
| ... | ... | @@ -27,7 +27,6 @@ |
| 27 | 27 | from __future__ import annotations |
| 28 | 28 | |
| 29 | 29 | import argparse |
| 30 | -import json | |
| 31 | 30 | import sys |
| 32 | 31 | from collections import defaultdict |
| 33 | 32 | from dataclasses import dataclass |
| ... | ... | @@ -54,7 +53,7 @@ class CacheTypeConfig: |
| 54 | 53 | |
| 55 | 54 | |
| 56 | 55 | def _load_known_cache_types() -> Dict[str, CacheTypeConfig]: |
| 57 | - """根据当前配置装配三种已知缓存类型及其前缀 pattern。""" | |
| 56 | + """根据当前配置装配仓库内仍在使用的缓存类型及其前缀 pattern。""" | |
| 58 | 57 | cache_types: Dict[str, CacheTypeConfig] = {} |
| 59 | 58 | |
| 60 | 59 | # embedding 缓存:prefix 来自 REDIS_CONFIG['embedding_cache_prefix'](默认 embedding) |
| ... | ... | @@ -72,14 +71,6 @@ def _load_known_cache_types() -> Dict[str, CacheTypeConfig]: |
| 72 | 71 | description="翻译结果缓存(translation/service.py)", |
| 73 | 72 | ) |
| 74 | 73 | |
| 75 | - # anchors 缓存:prefix 来自 REDIS_CONFIG['anchor_cache_prefix'](若存在),否则 product_anchors | |
| 76 | - anchor_prefix = REDIS_CONFIG.get("anchor_cache_prefix", "product_anchors") | |
| 77 | - cache_types["anchors"] = CacheTypeConfig( | |
| 78 | - name="anchors", | |
| 79 | - pattern=f"{anchor_prefix}:*", | |
| 80 | - description="商品内容理解缓存(indexer/product_enrich.py,anchors/语义属性/tags)", | |
| 81 | - ) | |
| 82 | - | |
| 83 | 74 | return cache_types |
| 84 | 75 | |
| 85 | 76 | |
| ... | ... | @@ -162,23 +153,6 @@ def decode_value_preview( |
| 162 | 153 | except Exception: |
| 163 | 154 | return f"<binary {len(raw_value)} bytes>" |
| 164 | 155 | |
| 165 | - # anchors: JSON dict | |
| 166 | - if cache_type == "anchors": | |
| 167 | - try: | |
| 168 | - text = raw_value.decode("utf-8", errors="replace") | |
| 169 | - obj = json.loads(text) | |
| 170 | - if isinstance(obj, dict): | |
| 171 | - brief = { | |
| 172 | - k: obj.get(k) | |
| 173 | - for k in ["id", "lang", "title_input", "title", "category_path", "anchor_text"] | |
| 174 | - if k in obj | |
| 175 | - } | |
| 176 | - return "json " + json.dumps(brief, ensure_ascii=False)[:200] | |
| 177 | - # 其他情况简单截断 | |
| 178 | - return "json " + text[:200] | |
| 179 | - except Exception: | |
| 180 | - return raw_value.decode("utf-8", errors="replace")[:200] | |
| 181 | - | |
| 182 | 156 | # translation: 纯字符串 |
| 183 | 157 | if cache_type == "translation": |
| 184 | 158 | try: |
| ... | ... | @@ -308,8 +282,8 @@ def main() -> None: |
| 308 | 282 | "--type", |
| 309 | 283 | dest="types", |
| 310 | 284 | nargs="+", |
| 311 | - choices=["embedding", "translation", "anchors"], | |
| 312 | - help="指定要检查的缓存类型(默认:三种全部)", | |
| 285 | + choices=["embedding", "translation"], | |
| 286 | + help="指定要检查的缓存类型(默认:两种全部)", | |
| 313 | 287 | ) |
| 314 | 288 | parser.add_argument( |
| 315 | 289 | "--pattern", | ... | ... |
scripts/redis/redis_cache_prefix_stats.py
| ... | ... | @@ -15,7 +15,7 @@ python scripts/redis/redis_cache_prefix_stats.py --all-db |
| 15 | 15 | 统计指定数据库: |
| 16 | 16 | python scripts/redis/redis_cache_prefix_stats.py --db 1 |
| 17 | 17 | |
| 18 | -只统计以下三种前缀: | |
| 18 | +只统计若干常见前缀: | |
| 19 | 19 | python scripts/redis/redis_cache_prefix_stats.py --prefix trans embedding product |
| 20 | 20 | |
| 21 | 21 | 统计所有数据库的指定前缀: | ... | ... |
tests/ci/test_service_api_contracts.py
| ... | ... | @@ -342,162 +342,15 @@ def test_indexer_build_docs_from_db_contract(indexer_client: TestClient): |
| 342 | 342 | assert data["docs"][0]["spu_id"] == "1001" |
| 343 | 343 | |
| 344 | 344 | |
| 345 | -def test_indexer_enrich_content_contract(indexer_client: TestClient, monkeypatch): | |
| 346 | - import indexer.product_enrich as process_products | |
| 347 | - | |
| 348 | - def _fake_build_index_content_fields( | |
| 349 | - items: List[Dict[str, str]], | |
| 350 | - tenant_id: str | None = None, | |
| 351 | - enrichment_scopes: List[str] | None = None, | |
| 352 | - category_taxonomy_profile: str = "apparel", | |
| 353 | - ): | |
| 354 | - assert tenant_id == "162" | |
| 355 | - assert enrichment_scopes == ["generic", "category_taxonomy"] | |
| 356 | - assert category_taxonomy_profile == "apparel" | |
| 357 | - return [ | |
| 358 | - { | |
| 359 | - "id": p["spu_id"], | |
| 360 | - "qanchors": { | |
| 361 | - "zh": [f"zh-anchor-{p['spu_id']}"], | |
| 362 | - "en": [f"en-anchor-{p['spu_id']}"], | |
| 363 | - }, | |
| 364 | - "enriched_tags": {"zh": ["tag1", "tag2"], "en": ["tag1", "tag2"]}, | |
| 365 | - "enriched_attributes": [ | |
| 366 | - {"name": "enriched_tags", "value": {"zh": ["tag1"], "en": ["tag1"]}}, | |
| 367 | - ], | |
| 368 | - "enriched_taxonomy_attributes": [ | |
| 369 | - {"name": "Product Type", "value": {"zh": ["T恤"], "en": ["t-shirt"]}}, | |
| 370 | - ], | |
| 371 | - } | |
| 372 | - for p in items | |
| 373 | - ] | |
| 374 | - | |
| 375 | - monkeypatch.setattr(process_products, "build_index_content_fields", _fake_build_index_content_fields) | |
| 376 | - | |
| 377 | - response = indexer_client.post( | |
| 378 | - "/indexer/enrich-content", | |
| 379 | - json={ | |
| 380 | - "tenant_id": "162", | |
| 381 | - "enrichment_scopes": ["generic", "category_taxonomy"], | |
| 382 | - "category_taxonomy_profile": "apparel", | |
| 383 | - "items": [ | |
| 384 | - {"spu_id": "1001", "title": "T-shirt"}, | |
| 385 | - {"spu_id": "1002", "title": "Toy"}, | |
| 386 | - ], | |
| 387 | - }, | |
| 388 | - ) | |
| 389 | - assert response.status_code == 200 | |
| 390 | - data = response.json() | |
| 391 | - assert data["tenant_id"] == "162" | |
| 392 | - assert data["enrichment_scopes"] == ["generic", "category_taxonomy"] | |
| 393 | - assert data["category_taxonomy_profile"] == "apparel" | |
| 394 | - assert data["total"] == 2 | |
| 395 | - assert len(data["results"]) == 2 | |
| 396 | - assert data["results"][0]["spu_id"] == "1001" | |
| 397 | - assert data["results"][0]["qanchors"]["zh"] == ["zh-anchor-1001"] | |
| 398 | - assert data["results"][0]["qanchors"]["en"] == ["en-anchor-1001"] | |
| 399 | - assert data["results"][0]["enriched_tags"]["zh"] == ["tag1", "tag2"] | |
| 400 | - assert data["results"][0]["enriched_tags"]["en"] == ["tag1", "tag2"] | |
| 401 | - assert data["results"][0]["enriched_attributes"][0] == { | |
| 402 | - "name": "enriched_tags", | |
| 403 | - "value": {"zh": ["tag1"], "en": ["tag1"]}, | |
| 404 | - } | |
| 405 | - assert data["results"][0]["enriched_taxonomy_attributes"][0] == { | |
| 406 | - "name": "Product Type", | |
| 407 | - "value": {"zh": ["T恤"], "en": ["t-shirt"]}, | |
| 408 | - } | |
| 409 | - | |
| 410 | - | |
| 411 | -def test_indexer_enrich_content_contract_accepts_deprecated_analysis_kinds(indexer_client: TestClient, monkeypatch): | |
| 412 | - import indexer.product_enrich as process_products | |
| 413 | - | |
| 414 | - seen: Dict[str, Any] = {} | |
| 415 | - | |
| 416 | - def _fake_build_index_content_fields( | |
| 417 | - items: List[Dict[str, str]], | |
| 418 | - tenant_id: str | None = None, | |
| 419 | - enrichment_scopes: List[str] | None = None, | |
| 420 | - category_taxonomy_profile: str = "apparel", | |
| 421 | - ): | |
| 422 | - seen["tenant_id"] = tenant_id | |
| 423 | - seen["enrichment_scopes"] = enrichment_scopes | |
| 424 | - seen["category_taxonomy_profile"] = category_taxonomy_profile | |
| 425 | - return [ | |
| 426 | - { | |
| 427 | - "id": items[0]["spu_id"], | |
| 428 | - "qanchors": {}, | |
| 429 | - "enriched_tags": {}, | |
| 430 | - "enriched_attributes": [], | |
| 431 | - "enriched_taxonomy_attributes": [], | |
| 432 | - } | |
| 433 | - ] | |
| 434 | - | |
| 435 | - monkeypatch.setattr(process_products, "build_index_content_fields", _fake_build_index_content_fields) | |
| 436 | - | |
| 345 | +def test_indexer_enrich_content_route_removed(indexer_client: TestClient): | |
| 437 | 346 | response = indexer_client.post( |
| 438 | 347 | "/indexer/enrich-content", |
| 439 | 348 | json={ |
| 440 | 349 | "tenant_id": "162", |
| 441 | - "analysis_kinds": ["taxonomy"], | |
| 442 | 350 | "items": [{"spu_id": "1001", "title": "T-shirt"}], |
| 443 | 351 | }, |
| 444 | 352 | ) |
| 445 | - | |
| 446 | - assert response.status_code == 200 | |
| 447 | - data = response.json() | |
| 448 | - assert seen == { | |
| 449 | - "tenant_id": "162", | |
| 450 | - "enrichment_scopes": ["category_taxonomy"], | |
| 451 | - "category_taxonomy_profile": "apparel", | |
| 452 | - } | |
| 453 | - assert data["enrichment_scopes"] == ["category_taxonomy"] | |
| 454 | - assert data["category_taxonomy_profile"] == "apparel" | |
| 455 | - | |
| 456 | - | |
| 457 | -def test_indexer_enrich_content_contract_supports_non_apparel_taxonomy_profiles(indexer_client: TestClient, monkeypatch): | |
| 458 | - import indexer.product_enrich as process_products | |
| 459 | - | |
| 460 | - def _fake_build_index_content_fields( | |
| 461 | - items: List[Dict[str, str]], | |
| 462 | - tenant_id: str | None = None, | |
| 463 | - enrichment_scopes: List[str] | None = None, | |
| 464 | - category_taxonomy_profile: str = "apparel", | |
| 465 | - ): | |
| 466 | - assert tenant_id == "162" | |
| 467 | - assert enrichment_scopes == ["category_taxonomy"] | |
| 468 | - assert category_taxonomy_profile == "toys" | |
| 469 | - return [ | |
| 470 | - { | |
| 471 | - "id": items[0]["spu_id"], | |
| 472 | - "qanchors": {}, | |
| 473 | - "enriched_tags": {}, | |
| 474 | - "enriched_attributes": [], | |
| 475 | - "enriched_taxonomy_attributes": [ | |
| 476 | - {"name": "Product Type", "value": {"en": ["doll set"]}}, | |
| 477 | - {"name": "Age Group", "value": {"en": ["kids"]}}, | |
| 478 | - ], | |
| 479 | - } | |
| 480 | - ] | |
| 481 | - | |
| 482 | - monkeypatch.setattr(process_products, "build_index_content_fields", _fake_build_index_content_fields) | |
| 483 | - | |
| 484 | - response = indexer_client.post( | |
| 485 | - "/indexer/enrich-content", | |
| 486 | - json={ | |
| 487 | - "tenant_id": "162", | |
| 488 | - "enrichment_scopes": ["category_taxonomy"], | |
| 489 | - "category_taxonomy_profile": "toys", | |
| 490 | - "items": [{"spu_id": "1001", "title": "Toy"}], | |
| 491 | - }, | |
| 492 | - ) | |
| 493 | - | |
| 494 | - assert response.status_code == 200 | |
| 495 | - data = response.json() | |
| 496 | - assert data["category_taxonomy_profile"] == "toys" | |
| 497 | - assert data["results"][0]["enriched_taxonomy_attributes"] == [ | |
| 498 | - {"name": "Product Type", "value": {"en": ["doll set"]}}, | |
| 499 | - {"name": "Age Group", "value": {"en": ["kids"]}}, | |
| 500 | - ] | |
| 353 | + assert response.status_code == 404 | |
| 501 | 354 | |
| 502 | 355 | |
| 503 | 356 | def test_indexer_documents_contract(indexer_client: TestClient): |
| ... | ... | @@ -614,17 +467,6 @@ def test_indexer_build_docs_from_db_validation_max_spu_ids(indexer_client: TestC |
| 614 | 467 | assert response.status_code == 400 |
| 615 | 468 | |
| 616 | 469 | |
| 617 | -def test_indexer_enrich_content_validation_max_items(indexer_client: TestClient): | |
| 618 | - response = indexer_client.post( | |
| 619 | - "/indexer/enrich-content", | |
| 620 | - json={ | |
| 621 | - "tenant_id": "162", | |
| 622 | - "items": [{"spu_id": str(i), "title": "x"} for i in range(51)], | |
| 623 | - }, | |
| 624 | - ) | |
| 625 | - assert response.status_code == 400 | |
| 626 | - | |
| 627 | - | |
| 628 | 470 | def test_indexer_documents_validation_max_spu_ids(indexer_client: TestClient): |
| 629 | 471 | """POST /indexer/documents: 400 when spu_ids > 100.""" |
| 630 | 472 | response = indexer_client.post( | ... | ... |
tests/test_llm_enrichment_batch_fill.py deleted
| ... | ... | @@ -1,72 +0,0 @@ |
| 1 | -from __future__ import annotations | |
| 2 | - | |
| 3 | -from typing import Any, Dict, List | |
| 4 | - | |
| 5 | -import pandas as pd | |
| 6 | - | |
| 7 | -from indexer.document_transformer import SPUDocumentTransformer | |
| 8 | - | |
| 9 | - | |
| 10 | -def test_fill_llm_attributes_batch_uses_product_enrich_helper(monkeypatch): | |
| 11 | - seen_calls: List[Dict[str, Any]] = [] | |
| 12 | - | |
| 13 | - def _fake_build_index_content_fields(items, tenant_id=None, category_taxonomy_profile=None): | |
| 14 | - seen_calls.append( | |
| 15 | - { | |
| 16 | - "n": len(items), | |
| 17 | - "tenant_id": tenant_id, | |
| 18 | - "category_taxonomy_profile": category_taxonomy_profile, | |
| 19 | - } | |
| 20 | - ) | |
| 21 | - return [ | |
| 22 | - { | |
| 23 | - "id": item["id"], | |
| 24 | - "qanchors": { | |
| 25 | - "zh": [f"zh-anchor-{item['id']}"], | |
| 26 | - "en": [f"en-anchor-{item['id']}"], | |
| 27 | - }, | |
| 28 | - "enriched_tags": {"zh": ["t1", "t2"], "en": ["t1", "t2"]}, | |
| 29 | - "enriched_attributes": [ | |
| 30 | - {"name": "tags", "value": {"zh": ["t1"], "en": ["t1"]}}, | |
| 31 | - ], | |
| 32 | - "enriched_taxonomy_attributes": [ | |
| 33 | - {"name": "Product Type", "value": {"zh": ["连衣裙"], "en": ["dress"]}}, | |
| 34 | - ], | |
| 35 | - } | |
| 36 | - for item in items | |
| 37 | - ] | |
| 38 | - | |
| 39 | - import indexer.document_transformer as doc_tr | |
| 40 | - | |
| 41 | - monkeypatch.setattr(doc_tr, "build_index_content_fields", _fake_build_index_content_fields) | |
| 42 | - | |
| 43 | - transformer = SPUDocumentTransformer( | |
| 44 | - category_id_to_name={}, | |
| 45 | - searchable_option_dimensions=[], | |
| 46 | - tenant_config={"index_languages": ["zh", "en"], "primary_language": "zh"}, | |
| 47 | - translator=None, | |
| 48 | - encoder=None, | |
| 49 | - enable_title_embedding=False, | |
| 50 | - image_encoder=None, | |
| 51 | - enable_image_embedding=False, | |
| 52 | - ) | |
| 53 | - | |
| 54 | - docs: List[Dict[str, Any]] = [] | |
| 55 | - rows: List[pd.Series] = [] | |
| 56 | - for i in range(45): | |
| 57 | - docs.append({"tenant_id": "162", "spu_id": str(i)}) | |
| 58 | - rows.append(pd.Series({"id": i, "title": f"title-{i}"})) | |
| 59 | - | |
| 60 | - transformer.fill_llm_attributes_batch(docs, rows) | |
| 61 | - | |
| 62 | - assert seen_calls == [{"n": 45, "tenant_id": "162", "category_taxonomy_profile": "apparel"}] | |
| 63 | - | |
| 64 | - assert docs[0]["qanchors"]["zh"] == ["zh-anchor-0"] | |
| 65 | - assert docs[0]["qanchors"]["en"] == ["en-anchor-0"] | |
| 66 | - assert docs[0]["enriched_tags"]["zh"] == ["t1", "t2"] | |
| 67 | - assert docs[0]["enriched_tags"]["en"] == ["t1", "t2"] | |
| 68 | - assert {"name": "tags", "value": {"zh": ["t1"], "en": ["t1"]}} in docs[0]["enriched_attributes"] | |
| 69 | - assert { | |
| 70 | - "name": "Product Type", | |
| 71 | - "value": {"zh": ["连衣裙"], "en": ["dress"]}, | |
| 72 | - } in docs[0]["enriched_taxonomy_attributes"] |
tests/test_process_products_batching.py deleted
| ... | ... | @@ -1,104 +0,0 @@ |
| 1 | -from __future__ import annotations | |
| 2 | - | |
| 3 | -from typing import Any, Dict, List | |
| 4 | - | |
| 5 | -import indexer.product_enrich as process_products | |
| 6 | - | |
| 7 | - | |
| 8 | -def _mk_products(n: int) -> List[Dict[str, str]]: | |
| 9 | - return [{"id": str(i), "title": f"title-{i}"} for i in range(n)] | |
| 10 | - | |
| 11 | - | |
| 12 | -def test_analyze_products_caps_batch_size_to_20(monkeypatch): | |
| 13 | - monkeypatch.setattr(process_products, "API_KEY", "fake-key") | |
| 14 | - seen_batch_sizes: List[int] = [] | |
| 15 | - | |
| 16 | - def _fake_process_batch( | |
| 17 | - batch_data: List[Dict[str, str]], | |
| 18 | - batch_num: int, | |
| 19 | - target_lang: str = "zh", | |
| 20 | - analysis_kind: str = "content", | |
| 21 | - category_taxonomy_profile=None, | |
| 22 | - ): | |
| 23 | - assert analysis_kind == "content" | |
| 24 | - assert category_taxonomy_profile is None | |
| 25 | - seen_batch_sizes.append(len(batch_data)) | |
| 26 | - return [ | |
| 27 | - { | |
| 28 | - "id": item["id"], | |
| 29 | - "lang": target_lang, | |
| 30 | - "title_input": item["title"], | |
| 31 | - "title": "", | |
| 32 | - "category_path": "", | |
| 33 | - "tags": "", | |
| 34 | - "target_audience": "", | |
| 35 | - "usage_scene": "", | |
| 36 | - "season": "", | |
| 37 | - "key_attributes": "", | |
| 38 | - "material": "", | |
| 39 | - "features": "", | |
| 40 | - "anchor_text": "", | |
| 41 | - } | |
| 42 | - for item in batch_data | |
| 43 | - ] | |
| 44 | - | |
| 45 | - monkeypatch.setattr(process_products, "process_batch", _fake_process_batch) | |
| 46 | - monkeypatch.setattr(process_products, "_set_cached_analysis_result", lambda *args, **kwargs: None) | |
| 47 | - | |
| 48 | - out = process_products.analyze_products( | |
| 49 | - products=_mk_products(45), | |
| 50 | - target_lang="zh", | |
| 51 | - batch_size=200, | |
| 52 | - tenant_id="162", | |
| 53 | - ) | |
| 54 | - | |
| 55 | - assert len(out) == 45 | |
| 56 | - # 并发执行时 batch 调用顺序可能变化,因此校验“批大小集合”而不是严格顺序 | |
| 57 | - assert sorted(seen_batch_sizes) == [5, 20, 20] | |
| 58 | - | |
| 59 | - | |
| 60 | -def test_analyze_products_uses_min_batch_size_1(monkeypatch): | |
| 61 | - monkeypatch.setattr(process_products, "API_KEY", "fake-key") | |
| 62 | - seen_batch_sizes: List[int] = [] | |
| 63 | - | |
| 64 | - def _fake_process_batch( | |
| 65 | - batch_data: List[Dict[str, str]], | |
| 66 | - batch_num: int, | |
| 67 | - target_lang: str = "zh", | |
| 68 | - analysis_kind: str = "content", | |
| 69 | - category_taxonomy_profile=None, | |
| 70 | - ): | |
| 71 | - assert analysis_kind == "content" | |
| 72 | - assert category_taxonomy_profile is None | |
| 73 | - seen_batch_sizes.append(len(batch_data)) | |
| 74 | - return [ | |
| 75 | - { | |
| 76 | - "id": item["id"], | |
| 77 | - "lang": target_lang, | |
| 78 | - "title_input": item["title"], | |
| 79 | - "title": "", | |
| 80 | - "category_path": "", | |
| 81 | - "tags": "", | |
| 82 | - "target_audience": "", | |
| 83 | - "usage_scene": "", | |
| 84 | - "season": "", | |
| 85 | - "key_attributes": "", | |
| 86 | - "material": "", | |
| 87 | - "features": "", | |
| 88 | - "anchor_text": "", | |
| 89 | - } | |
| 90 | - for item in batch_data | |
| 91 | - ] | |
| 92 | - | |
| 93 | - monkeypatch.setattr(process_products, "process_batch", _fake_process_batch) | |
| 94 | - monkeypatch.setattr(process_products, "_set_cached_analysis_result", lambda *args, **kwargs: None) | |
| 95 | - | |
| 96 | - out = process_products.analyze_products( | |
| 97 | - products=_mk_products(3), | |
| 98 | - target_lang="zh", | |
| 99 | - batch_size=0, | |
| 100 | - tenant_id="162", | |
| 101 | - ) | |
| 102 | - | |
| 103 | - assert len(out) == 3 | |
| 104 | - assert seen_batch_sizes == [1, 1, 1] |
tests/test_product_enrich_partial_mode.py deleted
| ... | ... | @@ -1,736 +0,0 @@ |
| 1 | -from __future__ import annotations | |
| 2 | - | |
| 3 | -import importlib.util | |
| 4 | -import io | |
| 5 | -import json | |
| 6 | -import logging | |
| 7 | -import sys | |
| 8 | -import types | |
| 9 | -from pathlib import Path | |
| 10 | -from unittest import mock | |
| 11 | - | |
| 12 | - | |
| 13 | -def _load_product_enrich_module(): | |
| 14 | - if "dotenv" not in sys.modules: | |
| 15 | - fake_dotenv = types.ModuleType("dotenv") | |
| 16 | - fake_dotenv.load_dotenv = lambda *args, **kwargs: None | |
| 17 | - sys.modules["dotenv"] = fake_dotenv | |
| 18 | - | |
| 19 | - if "redis" not in sys.modules: | |
| 20 | - fake_redis = types.ModuleType("redis") | |
| 21 | - | |
| 22 | - class _FakeRedisClient: | |
| 23 | - def __init__(self, *args, **kwargs): | |
| 24 | - pass | |
| 25 | - | |
| 26 | - def ping(self): | |
| 27 | - return True | |
| 28 | - | |
| 29 | - fake_redis.Redis = _FakeRedisClient | |
| 30 | - sys.modules["redis"] = fake_redis | |
| 31 | - | |
| 32 | - repo_root = Path(__file__).resolve().parents[1] | |
| 33 | - if str(repo_root) not in sys.path: | |
| 34 | - sys.path.insert(0, str(repo_root)) | |
| 35 | - | |
| 36 | - module_path = repo_root / "indexer" / "product_enrich.py" | |
| 37 | - spec = importlib.util.spec_from_file_location("product_enrich_under_test", module_path) | |
| 38 | - module = importlib.util.module_from_spec(spec) | |
| 39 | - assert spec and spec.loader | |
| 40 | - spec.loader.exec_module(module) | |
| 41 | - return module | |
| 42 | - | |
| 43 | - | |
| 44 | -product_enrich = _load_product_enrich_module() | |
| 45 | - | |
| 46 | - | |
| 47 | -def _attach_stream(logger_obj: logging.Logger): | |
| 48 | - stream = io.StringIO() | |
| 49 | - handler = logging.StreamHandler(stream) | |
| 50 | - handler.setFormatter(logging.Formatter("%(message)s")) | |
| 51 | - logger_obj.addHandler(handler) | |
| 52 | - return stream, handler | |
| 53 | - | |
| 54 | - | |
| 55 | -def test_create_prompt_splits_shared_context_and_localized_tail(): | |
| 56 | - products = [ | |
| 57 | - {"id": "1", "title": "dress"}, | |
| 58 | - {"id": "2", "title": "linen shirt"}, | |
| 59 | - ] | |
| 60 | - | |
| 61 | - shared_zh, user_zh, prefix_zh = product_enrich.create_prompt(products, target_lang="zh") | |
| 62 | - shared_en, user_en, prefix_en = product_enrich.create_prompt(products, target_lang="en") | |
| 63 | - | |
| 64 | - assert shared_zh == shared_en | |
| 65 | - assert "Analyze each input product text" in shared_zh | |
| 66 | - assert "1. dress" in shared_zh | |
| 67 | - assert "2. linen shirt" in shared_zh | |
| 68 | - assert "Product list" not in user_zh | |
| 69 | - assert "Product list" not in user_en | |
| 70 | - assert "specified language" in user_zh | |
| 71 | - assert "Language: Chinese" in user_zh | |
| 72 | - assert "Language: English" in user_en | |
| 73 | - assert prefix_zh.startswith("| 序号 | 商品标题 | 品类路径 |") | |
| 74 | - assert prefix_en.startswith("| No. | Product title | Category path |") | |
| 75 | - | |
| 76 | - | |
| 77 | -def test_create_prompt_supports_taxonomy_analysis_kind(): | |
| 78 | - products = [{"id": "1", "title": "linen dress"}] | |
| 79 | - | |
| 80 | - shared_zh, user_zh, prefix_zh = product_enrich.create_prompt( | |
| 81 | - products, | |
| 82 | - target_lang="zh", | |
| 83 | - analysis_kind="taxonomy", | |
| 84 | - ) | |
| 85 | - shared_fr, user_fr, prefix_fr = product_enrich.create_prompt( | |
| 86 | - products, | |
| 87 | - target_lang="fr", | |
| 88 | - analysis_kind="taxonomy", | |
| 89 | - ) | |
| 90 | - | |
| 91 | - assert "apparel attribute taxonomy" in shared_zh | |
| 92 | - assert "1. linen dress" in shared_zh | |
| 93 | - assert "Language: Chinese" in user_zh | |
| 94 | - assert "Language: French" in user_fr | |
| 95 | - assert prefix_zh.startswith("| 序号 | 品类 | 目标性别 |") | |
| 96 | - assert prefix_fr.startswith("| No. | Product Type | Target Gender |") | |
| 97 | - | |
| 98 | - | |
| 99 | -def test_call_llm_logs_shared_context_once_and_verbose_contains_full_requests(): | |
| 100 | - payloads = [] | |
| 101 | - response_bodies = [ | |
| 102 | - { | |
| 103 | - "choices": [ | |
| 104 | - { | |
| 105 | - "message": { | |
| 106 | - "content": ( | |
| 107 | - "| 1 | 连衣裙 | 女装>连衣裙 | 法式,收腰 | 年轻女性 | " | |
| 108 | - "通勤,约会 | 春季,夏季 | 中长款 | 聚酯纤维 | 透气 | " | |
| 109 | - "修身显瘦 | 法式收腰连衣裙 |\n" | |
| 110 | - ) | |
| 111 | - } | |
| 112 | - } | |
| 113 | - ], | |
| 114 | - "usage": {"prompt_tokens": 120, "completion_tokens": 45, "total_tokens": 165}, | |
| 115 | - }, | |
| 116 | - { | |
| 117 | - "choices": [ | |
| 118 | - { | |
| 119 | - "message": { | |
| 120 | - "content": ( | |
| 121 | - "| 1 | Dress | Women>Dress | French,Waisted | Young women | " | |
| 122 | - "Commute,Date | Spring,Summer | Midi | Polyester | Breathable | " | |
| 123 | - "Slim fit | French waisted dress |\n" | |
| 124 | - ) | |
| 125 | - } | |
| 126 | - } | |
| 127 | - ], | |
| 128 | - "usage": {"prompt_tokens": 118, "completion_tokens": 43, "total_tokens": 161}, | |
| 129 | - }, | |
| 130 | - ] | |
| 131 | - | |
| 132 | - class _FakeResponse: | |
| 133 | - def __init__(self, body): | |
| 134 | - self.body = body | |
| 135 | - | |
| 136 | - def raise_for_status(self): | |
| 137 | - return None | |
| 138 | - | |
| 139 | - def json(self): | |
| 140 | - return self.body | |
| 141 | - | |
| 142 | - class _FakeSession: | |
| 143 | - trust_env = True | |
| 144 | - | |
| 145 | - def post(self, url, headers=None, json=None, timeout=None, proxies=None): | |
| 146 | - del url, headers, timeout, proxies | |
| 147 | - payloads.append(json) | |
| 148 | - return _FakeResponse(response_bodies[len(payloads) - 1]) | |
| 149 | - | |
| 150 | - def close(self): | |
| 151 | - return None | |
| 152 | - | |
| 153 | - product_enrich.reset_logged_shared_context_keys() | |
| 154 | - main_stream, main_handler = _attach_stream(product_enrich.logger) | |
| 155 | - verbose_stream, verbose_handler = _attach_stream(product_enrich.verbose_logger) | |
| 156 | - | |
| 157 | - try: | |
| 158 | - with mock.patch.object(product_enrich, "API_KEY", "fake-key"), mock.patch.object( | |
| 159 | - product_enrich.requests, | |
| 160 | - "Session", | |
| 161 | - lambda: _FakeSession(), | |
| 162 | - ): | |
| 163 | - zh_shared, zh_user, zh_prefix = product_enrich.create_prompt( | |
| 164 | - [{"id": "1", "title": "dress"}], | |
| 165 | - target_lang="zh", | |
| 166 | - ) | |
| 167 | - en_shared, en_user, en_prefix = product_enrich.create_prompt( | |
| 168 | - [{"id": "1", "title": "dress"}], | |
| 169 | - target_lang="en", | |
| 170 | - ) | |
| 171 | - | |
| 172 | - zh_markdown, zh_raw = product_enrich.call_llm( | |
| 173 | - zh_shared, | |
| 174 | - zh_user, | |
| 175 | - zh_prefix, | |
| 176 | - target_lang="zh", | |
| 177 | - ) | |
| 178 | - en_markdown, en_raw = product_enrich.call_llm( | |
| 179 | - en_shared, | |
| 180 | - en_user, | |
| 181 | - en_prefix, | |
| 182 | - target_lang="en", | |
| 183 | - ) | |
| 184 | - finally: | |
| 185 | - product_enrich.logger.removeHandler(main_handler) | |
| 186 | - product_enrich.verbose_logger.removeHandler(verbose_handler) | |
| 187 | - | |
| 188 | - assert zh_shared == en_shared | |
| 189 | - assert len(payloads) == 2 | |
| 190 | - assert len(payloads[0]["messages"]) == 3 | |
| 191 | - assert payloads[0]["messages"][1]["role"] == "user" | |
| 192 | - assert "1. dress" in payloads[0]["messages"][1]["content"] | |
| 193 | - assert "Language: Chinese" in payloads[0]["messages"][1]["content"] | |
| 194 | - assert "Language: English" in payloads[1]["messages"][1]["content"] | |
| 195 | - assert payloads[0]["messages"][-1]["partial"] is True | |
| 196 | - assert payloads[1]["messages"][-1]["partial"] is True | |
| 197 | - | |
| 198 | - main_log = main_stream.getvalue() | |
| 199 | - verbose_log = verbose_stream.getvalue() | |
| 200 | - | |
| 201 | - assert main_log.count("LLM Shared Context") == 1 | |
| 202 | - assert main_log.count("LLM Request Variant") == 2 | |
| 203 | - assert "Localized Requirement" in main_log | |
| 204 | - assert "Shared Context" in main_log | |
| 205 | - | |
| 206 | - assert verbose_log.count("LLM Request [model=") == 2 | |
| 207 | - assert verbose_log.count("LLM Response [model=") == 2 | |
| 208 | - assert '"partial": true' in verbose_log | |
| 209 | - assert "Combined User Prompt" in verbose_log | |
| 210 | - assert "French waisted dress" in verbose_log | |
| 211 | - assert "法式收腰连衣裙" in verbose_log | |
| 212 | - | |
| 213 | - assert zh_markdown.startswith(zh_prefix) | |
| 214 | - assert en_markdown.startswith(en_prefix) | |
| 215 | - assert json.loads(zh_raw)["usage"]["total_tokens"] == 165 | |
| 216 | - assert json.loads(en_raw)["usage"]["total_tokens"] == 161 | |
| 217 | - | |
| 218 | - | |
| 219 | -def test_process_batch_reads_result_and_validates_expected_fields(): | |
| 220 | - merged_markdown = """| 序号 | 商品标题 | 品类路径 | 细分标签 | 适用人群 | 使用场景 | 适用季节 | 关键属性 | 材质说明 | 功能特点 | 锚文本 | | |
| 221 | -|----|----|----|----|----|----|----|----|----|----|----| | |
| 222 | -| 1 | 法式连衣裙 | 女装>连衣裙 | 法式,收腰 | 年轻女性 | 通勤,约会 | 春季,夏季 | 中长款 | 聚酯纤维 | 透气 | 法式收腰连衣裙 | | |
| 223 | -""" | |
| 224 | - | |
| 225 | - with mock.patch.object( | |
| 226 | - product_enrich, | |
| 227 | - "call_llm", | |
| 228 | - return_value=(merged_markdown, json.dumps({"choices": [{"message": {"content": "stub"}}]})), | |
| 229 | - ): | |
| 230 | - results = product_enrich.process_batch( | |
| 231 | - [{"id": "sku-1", "title": "dress"}], | |
| 232 | - batch_num=1, | |
| 233 | - target_lang="zh", | |
| 234 | - ) | |
| 235 | - | |
| 236 | - assert len(results) == 1 | |
| 237 | - row = results[0] | |
| 238 | - assert row["id"] == "sku-1" | |
| 239 | - assert row["lang"] == "zh" | |
| 240 | - assert row["title_input"] == "dress" | |
| 241 | - assert row["title"] == "法式连衣裙" | |
| 242 | - assert row["category_path"] == "女装>连衣裙" | |
| 243 | - assert row["tags"] == "法式,收腰" | |
| 244 | - assert row["target_audience"] == "年轻女性" | |
| 245 | - assert row["usage_scene"] == "通勤,约会" | |
| 246 | - assert row["season"] == "春季,夏季" | |
| 247 | - assert row["key_attributes"] == "中长款" | |
| 248 | - assert row["material"] == "聚酯纤维" | |
| 249 | - assert row["features"] == "透气" | |
| 250 | - assert row["anchor_text"] == "法式收腰连衣裙" | |
| 251 | - | |
| 252 | - | |
| 253 | -def test_process_batch_reads_taxonomy_result_with_schema_specific_fields(): | |
| 254 | - merged_markdown = """| 序号 | 品类 | 目标性别 | 年龄段 | 适用季节 | 版型 | 廓形 | 领型 | 袖长类型 | 袖型 | 肩带设计 | 腰型 | 裤型 | 裙型 | 长度类型 | 闭合方式 | 设计细节 | 面料 | 成分 | 面料特性 | 服装特征 | 功能 | 主颜色 | 色系 | 印花 / 图案 | 适用场景 | 风格 | | |
| 255 | -|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----| | |
| 256 | -| 1 | 连衣裙 | 女 | 成人 | 春季,夏季 | 修身 | A字 | V领 | 无袖 | | 细肩带 | 高腰 | | A字裙 | 中长款 | 拉链 | 褶皱 | 梭织 | 聚酯纤维,氨纶 | 轻薄,透气 | 有内衬 | 易打理 | 酒红色 | 红色 | 纯色 | 约会,度假 | 浪漫 | | |
| 257 | -""" | |
| 258 | - | |
| 259 | - with mock.patch.object( | |
| 260 | - product_enrich, | |
| 261 | - "call_llm", | |
| 262 | - return_value=(merged_markdown, json.dumps({"choices": [{"message": {"content": "stub"}}]})), | |
| 263 | - ): | |
| 264 | - results = product_enrich.process_batch( | |
| 265 | - [{"id": "sku-1", "title": "dress"}], | |
| 266 | - batch_num=1, | |
| 267 | - target_lang="zh", | |
| 268 | - analysis_kind="taxonomy", | |
| 269 | - ) | |
| 270 | - | |
| 271 | - assert len(results) == 1 | |
| 272 | - row = results[0] | |
| 273 | - assert row["id"] == "sku-1" | |
| 274 | - assert row["lang"] == "zh" | |
| 275 | - assert row["title_input"] == "dress" | |
| 276 | - assert row["product_type"] == "连衣裙" | |
| 277 | - assert row["target_gender"] == "女" | |
| 278 | - assert row["age_group"] == "成人" | |
| 279 | - assert row["sleeve_length_type"] == "无袖" | |
| 280 | - assert row["material_composition"] == "聚酯纤维,氨纶" | |
| 281 | - assert row["occasion_end_use"] == "约会,度假" | |
| 282 | - assert row["style_aesthetic"] == "浪漫" | |
| 283 | - | |
| 284 | - | |
| 285 | -def test_analyze_products_uses_product_level_cache_across_batch_requests(): | |
| 286 | - cache_store = {} | |
| 287 | - process_calls = [] | |
| 288 | - | |
| 289 | - def _cache_key(product, target_lang): | |
| 290 | - return ( | |
| 291 | - target_lang, | |
| 292 | - product.get("title", ""), | |
| 293 | - product.get("brief", ""), | |
| 294 | - product.get("description", ""), | |
| 295 | - product.get("image_url", ""), | |
| 296 | - ) | |
| 297 | - | |
| 298 | - def fake_get_cached_analysis_result( | |
| 299 | - product, | |
| 300 | - target_lang, | |
| 301 | - analysis_kind="content", | |
| 302 | - category_taxonomy_profile=None, | |
| 303 | - ): | |
| 304 | - assert analysis_kind == "content" | |
| 305 | - assert category_taxonomy_profile is None | |
| 306 | - return cache_store.get(_cache_key(product, target_lang)) | |
| 307 | - | |
| 308 | - def fake_set_cached_analysis_result( | |
| 309 | - product, | |
| 310 | - target_lang, | |
| 311 | - result, | |
| 312 | - analysis_kind="content", | |
| 313 | - category_taxonomy_profile=None, | |
| 314 | - ): | |
| 315 | - assert analysis_kind == "content" | |
| 316 | - assert category_taxonomy_profile is None | |
| 317 | - cache_store[_cache_key(product, target_lang)] = result | |
| 318 | - | |
| 319 | - def fake_process_batch( | |
| 320 | - batch_data, | |
| 321 | - batch_num, | |
| 322 | - target_lang="zh", | |
| 323 | - analysis_kind="content", | |
| 324 | - category_taxonomy_profile=None, | |
| 325 | - ): | |
| 326 | - assert analysis_kind == "content" | |
| 327 | - assert category_taxonomy_profile is None | |
| 328 | - process_calls.append( | |
| 329 | - { | |
| 330 | - "batch_num": batch_num, | |
| 331 | - "target_lang": target_lang, | |
| 332 | - "titles": [item["title"] for item in batch_data], | |
| 333 | - } | |
| 334 | - ) | |
| 335 | - return [ | |
| 336 | - { | |
| 337 | - "id": item["id"], | |
| 338 | - "lang": target_lang, | |
| 339 | - "title_input": item["title"], | |
| 340 | - "title": f"normalized:{item['title']}", | |
| 341 | - "category_path": "cat", | |
| 342 | - "tags": "tags", | |
| 343 | - "target_audience": "audience", | |
| 344 | - "usage_scene": "scene", | |
| 345 | - "season": "season", | |
| 346 | - "key_attributes": "attrs", | |
| 347 | - "material": "material", | |
| 348 | - "features": "features", | |
| 349 | - "anchor_text": f"anchor:{item['title']}", | |
| 350 | - } | |
| 351 | - for item in batch_data | |
| 352 | - ] | |
| 353 | - | |
| 354 | - products = [ | |
| 355 | - {"id": "1", "title": "dress"}, | |
| 356 | - {"id": "2", "title": "shirt"}, | |
| 357 | - ] | |
| 358 | - | |
| 359 | - with mock.patch.object(product_enrich, "API_KEY", "fake-key"), mock.patch.object( | |
| 360 | - product_enrich, | |
| 361 | - "_get_cached_analysis_result", | |
| 362 | - side_effect=fake_get_cached_analysis_result, | |
| 363 | - ), mock.patch.object( | |
| 364 | - product_enrich, | |
| 365 | - "_set_cached_analysis_result", | |
| 366 | - side_effect=fake_set_cached_analysis_result, | |
| 367 | - ), mock.patch.object( | |
| 368 | - product_enrich, | |
| 369 | - "process_batch", | |
| 370 | - side_effect=fake_process_batch, | |
| 371 | - ): | |
| 372 | - first = product_enrich.analyze_products( | |
| 373 | - [products[0]], | |
| 374 | - target_lang="zh", | |
| 375 | - tenant_id="170", | |
| 376 | - ) | |
| 377 | - second = product_enrich.analyze_products( | |
| 378 | - products, | |
| 379 | - target_lang="zh", | |
| 380 | - tenant_id="999", | |
| 381 | - ) | |
| 382 | - third = product_enrich.analyze_products( | |
| 383 | - products, | |
| 384 | - target_lang="zh", | |
| 385 | - tenant_id="170", | |
| 386 | - ) | |
| 387 | - | |
| 388 | - assert [row["title_input"] for row in first] == ["dress"] | |
| 389 | - assert [row["title_input"] for row in second] == ["dress", "shirt"] | |
| 390 | - assert [row["title_input"] for row in third] == ["dress", "shirt"] | |
| 391 | - | |
| 392 | - assert process_calls == [ | |
| 393 | - {"batch_num": 1, "target_lang": "zh", "titles": ["dress"]}, | |
| 394 | - {"batch_num": 1, "target_lang": "zh", "titles": ["shirt"]}, | |
| 395 | - ] | |
| 396 | - assert second[0]["anchor_text"] == "anchor:dress" | |
| 397 | - assert second[1]["anchor_text"] == "anchor:shirt" | |
| 398 | - assert third[0]["anchor_text"] == "anchor:dress" | |
| 399 | - assert third[1]["anchor_text"] == "anchor:shirt" | |
| 400 | - | |
| 401 | - | |
| 402 | -def test_analyze_products_reuses_cached_content_with_current_product_identity(): | |
| 403 | - cached_result = { | |
| 404 | - "id": "1165", | |
| 405 | - "lang": "zh", | |
| 406 | - "title_input": "old-title", | |
| 407 | - "title": "法式连衣裙", | |
| 408 | - "category_path": "女装>连衣裙", | |
| 409 | - "enriched_tags": "法式,收腰", | |
| 410 | - "target_audience": "年轻女性", | |
| 411 | - "usage_scene": "通勤,约会", | |
| 412 | - "season": "春季,夏季", | |
| 413 | - "key_attributes": "中长款", | |
| 414 | - "material": "聚酯纤维", | |
| 415 | - "features": "透气", | |
| 416 | - "anchor_text": "法式收腰连衣裙", | |
| 417 | - } | |
| 418 | - products = [{"id": "69960", "title": "dress"}] | |
| 419 | - | |
| 420 | - with mock.patch.object(product_enrich, "API_KEY", "fake-key"), mock.patch.object( | |
| 421 | - product_enrich, | |
| 422 | - "_get_cached_analysis_result", | |
| 423 | - wraps=lambda product, target_lang, analysis_kind="content", category_taxonomy_profile=None: product_enrich._normalize_analysis_result( | |
| 424 | - cached_result, | |
| 425 | - product=product, | |
| 426 | - target_lang=target_lang, | |
| 427 | - schema=product_enrich._get_analysis_schema("content"), | |
| 428 | - ), | |
| 429 | - ), mock.patch.object( | |
| 430 | - product_enrich, | |
| 431 | - "process_batch", | |
| 432 | - side_effect=AssertionError("process_batch should not be called on cache hit"), | |
| 433 | - ): | |
| 434 | - result = product_enrich.analyze_products( | |
| 435 | - products, | |
| 436 | - target_lang="zh", | |
| 437 | - tenant_id="170", | |
| 438 | - ) | |
| 439 | - | |
| 440 | - assert result == [ | |
| 441 | - { | |
| 442 | - "id": "69960", | |
| 443 | - "lang": "zh", | |
| 444 | - "title_input": "dress", | |
| 445 | - "title": "法式连衣裙", | |
| 446 | - "category_path": "女装>连衣裙", | |
| 447 | - "tags": "法式,收腰", | |
| 448 | - "target_audience": "年轻女性", | |
| 449 | - "usage_scene": "通勤,约会", | |
| 450 | - "season": "春季,夏季", | |
| 451 | - "key_attributes": "中长款", | |
| 452 | - "material": "聚酯纤维", | |
| 453 | - "features": "透气", | |
| 454 | - "anchor_text": "法式收腰连衣裙", | |
| 455 | - } | |
| 456 | - ] | |
| 457 | - | |
| 458 | - | |
| 459 | -def test_build_index_content_fields_maps_internal_tags_to_enriched_tags_output(): | |
| 460 | - def fake_analyze_products( | |
| 461 | - products, | |
| 462 | - target_lang="zh", | |
| 463 | - batch_size=None, | |
| 464 | - tenant_id=None, | |
| 465 | - analysis_kind="content", | |
| 466 | - category_taxonomy_profile=None, | |
| 467 | - ): | |
| 468 | - if analysis_kind == "taxonomy": | |
| 469 | - assert category_taxonomy_profile == "apparel" | |
| 470 | - return [ | |
| 471 | - { | |
| 472 | - "id": products[0]["id"], | |
| 473 | - "lang": target_lang, | |
| 474 | - "title_input": products[0]["title"], | |
| 475 | - "product_type": f"{target_lang}-dress", | |
| 476 | - "target_gender": f"{target_lang}-women", | |
| 477 | - "age_group": "", | |
| 478 | - "season": f"{target_lang}-summer", | |
| 479 | - "fit": "", | |
| 480 | - "silhouette": "", | |
| 481 | - "neckline": "", | |
| 482 | - "sleeve_length_type": "", | |
| 483 | - "sleeve_style": "", | |
| 484 | - "strap_type": "", | |
| 485 | - "rise_waistline": "", | |
| 486 | - "leg_shape": "", | |
| 487 | - "skirt_shape": "", | |
| 488 | - "length_type": "", | |
| 489 | - "closure_type": "", | |
| 490 | - "design_details": "", | |
| 491 | - "fabric": "", | |
| 492 | - "material_composition": "", | |
| 493 | - "fabric_properties": "", | |
| 494 | - "clothing_features": "", | |
| 495 | - "functional_benefits": "", | |
| 496 | - "color": "", | |
| 497 | - "color_family": "", | |
| 498 | - "print_pattern": "", | |
| 499 | - "occasion_end_use": "", | |
| 500 | - "style_aesthetic": "", | |
| 501 | - } | |
| 502 | - ] | |
| 503 | - return [ | |
| 504 | - { | |
| 505 | - "id": products[0]["id"], | |
| 506 | - "lang": target_lang, | |
| 507 | - "title_input": products[0]["title"], | |
| 508 | - "title": products[0]["title"], | |
| 509 | - "category_path": "玩具>滑行玩具", | |
| 510 | - "tags": f"{target_lang}-tag1,{target_lang}-tag2", | |
| 511 | - "target_audience": f"{target_lang}-audience", | |
| 512 | - "usage_scene": "", | |
| 513 | - "season": "", | |
| 514 | - "key_attributes": "", | |
| 515 | - "material": "", | |
| 516 | - "features": "", | |
| 517 | - "anchor_text": f"{target_lang}-anchor", | |
| 518 | - } | |
| 519 | - ] | |
| 520 | - | |
| 521 | - with mock.patch.object( | |
| 522 | - product_enrich, | |
| 523 | - "analyze_products", | |
| 524 | - side_effect=fake_analyze_products, | |
| 525 | - ): | |
| 526 | - result = product_enrich.build_index_content_fields( | |
| 527 | - items=[{"spu_id": "69960", "title": "dress"}], | |
| 528 | - tenant_id="170", | |
| 529 | - ) | |
| 530 | - | |
| 531 | - assert result == [ | |
| 532 | - { | |
| 533 | - "id": "69960", | |
| 534 | - "qanchors": {"zh": ["zh-anchor"], "en": ["en-anchor"]}, | |
| 535 | - "enriched_tags": {"zh": ["zh-tag1", "zh-tag2"], "en": ["en-tag1", "en-tag2"]}, | |
| 536 | - "enriched_attributes": [ | |
| 537 | - { | |
| 538 | - "name": "enriched_tags", | |
| 539 | - "value": { | |
| 540 | - "zh": ["zh-tag1", "zh-tag2"], | |
| 541 | - "en": ["en-tag1", "en-tag2"], | |
| 542 | - }, | |
| 543 | - }, | |
| 544 | - {"name": "target_audience", "value": {"zh": ["zh-audience"], "en": ["en-audience"]}}, | |
| 545 | - ], | |
| 546 | - "enriched_taxonomy_attributes": [ | |
| 547 | - { | |
| 548 | - "name": "Product Type", | |
| 549 | - "value": {"zh": ["zh-dress"], "en": ["en-dress"]}, | |
| 550 | - }, | |
| 551 | - { | |
| 552 | - "name": "Target Gender", | |
| 553 | - "value": {"zh": ["zh-women"], "en": ["en-women"]}, | |
| 554 | - }, | |
| 555 | - { | |
| 556 | - "name": "Season", | |
| 557 | - "value": {"zh": ["zh-summer"], "en": ["en-summer"]}, | |
| 558 | - }, | |
| 559 | - ], | |
| 560 | - } | |
| 561 | - ] | |
| 562 | -def test_build_index_content_fields_non_apparel_taxonomy_returns_en_only(): | |
| 563 | - seen_calls = [] | |
| 564 | - | |
| 565 | - def fake_analyze_products( | |
| 566 | - products, | |
| 567 | - target_lang="zh", | |
| 568 | - batch_size=None, | |
| 569 | - tenant_id=None, | |
| 570 | - analysis_kind="content", | |
| 571 | - category_taxonomy_profile=None, | |
| 572 | - ): | |
| 573 | - seen_calls.append((analysis_kind, target_lang, category_taxonomy_profile, tuple(p["id"] for p in products))) | |
| 574 | - if analysis_kind == "taxonomy": | |
| 575 | - assert category_taxonomy_profile == "toys" | |
| 576 | - assert target_lang == "en" | |
| 577 | - return [ | |
| 578 | - { | |
| 579 | - "id": products[0]["id"], | |
| 580 | - "lang": "en", | |
| 581 | - "title_input": products[0]["title"], | |
| 582 | - "product_type": "doll set", | |
| 583 | - "age_group": "kids", | |
| 584 | - "character_theme": "", | |
| 585 | - "material": "", | |
| 586 | - "power_source": "", | |
| 587 | - "interactive_features": "", | |
| 588 | - "educational_play_value": "", | |
| 589 | - "piece_count_size": "", | |
| 590 | - "color": "", | |
| 591 | - "use_scenario": "", | |
| 592 | - } | |
| 593 | - ] | |
| 594 | - | |
| 595 | - return [ | |
| 596 | - { | |
| 597 | - "id": product["id"], | |
| 598 | - "lang": target_lang, | |
| 599 | - "title_input": product["title"], | |
| 600 | - "title": product["title"], | |
| 601 | - "category_path": "", | |
| 602 | - "tags": f"{target_lang}-tag", | |
| 603 | - "target_audience": "", | |
| 604 | - "usage_scene": "", | |
| 605 | - "season": "", | |
| 606 | - "key_attributes": "", | |
| 607 | - "material": "", | |
| 608 | - "features": "", | |
| 609 | - "anchor_text": f"{target_lang}-anchor", | |
| 610 | - } | |
| 611 | - for product in products | |
| 612 | - ] | |
| 613 | - | |
| 614 | - with mock.patch.object(product_enrich, "analyze_products", side_effect=fake_analyze_products): | |
| 615 | - result = product_enrich.build_index_content_fields( | |
| 616 | - items=[{"spu_id": "2", "title": "toy"}], | |
| 617 | - tenant_id="170", | |
| 618 | - category_taxonomy_profile="toys", | |
| 619 | - ) | |
| 620 | - | |
| 621 | - assert result == [ | |
| 622 | - { | |
| 623 | - "id": "2", | |
| 624 | - "qanchors": {"zh": ["zh-anchor"], "en": ["en-anchor"]}, | |
| 625 | - "enriched_tags": {"zh": ["zh-tag"], "en": ["en-tag"]}, | |
| 626 | - "enriched_attributes": [ | |
| 627 | - { | |
| 628 | - "name": "enriched_tags", | |
| 629 | - "value": { | |
| 630 | - "zh": ["zh-tag"], | |
| 631 | - "en": ["en-tag"], | |
| 632 | - }, | |
| 633 | - } | |
| 634 | - ], | |
| 635 | - "enriched_taxonomy_attributes": [ | |
| 636 | - {"name": "Product Type", "value": {"en": ["doll set"]}}, | |
| 637 | - {"name": "Age Group", "value": {"en": ["kids"]}}, | |
| 638 | - ], | |
| 639 | - } | |
| 640 | - ] | |
| 641 | - assert ("taxonomy", "zh", "toys", ("2",)) not in seen_calls | |
| 642 | - assert ("taxonomy", "en", "toys", ("2",)) in seen_calls | |
| 643 | - | |
| 644 | - | |
| 645 | -def test_anchor_cache_key_depends_on_product_input_not_identifiers(): | |
| 646 | - product_a = { | |
| 647 | - "id": "1", | |
| 648 | - "spu_id": "1001", | |
| 649 | - "title": "dress", | |
| 650 | - "brief": "soft cotton", | |
| 651 | - "description": "summer dress", | |
| 652 | - "image_url": "https://img/a.jpg", | |
| 653 | - } | |
| 654 | - product_b = { | |
| 655 | - "id": "2", | |
| 656 | - "spu_id": "9999", | |
| 657 | - "title": "dress", | |
| 658 | - "brief": "soft cotton", | |
| 659 | - "description": "summer dress", | |
| 660 | - "image_url": "https://img/a.jpg", | |
| 661 | - } | |
| 662 | - product_c = { | |
| 663 | - "id": "1", | |
| 664 | - "spu_id": "1001", | |
| 665 | - "title": "dress", | |
| 666 | - "brief": "soft cotton updated", | |
| 667 | - "description": "summer dress", | |
| 668 | - "image_url": "https://img/a.jpg", | |
| 669 | - } | |
| 670 | - | |
| 671 | - key_a = product_enrich._make_anchor_cache_key(product_a, "zh") | |
| 672 | - key_b = product_enrich._make_anchor_cache_key(product_b, "zh") | |
| 673 | - key_c = product_enrich._make_anchor_cache_key(product_c, "zh") | |
| 674 | - | |
| 675 | - assert key_a == key_b | |
| 676 | - assert key_a != key_c | |
| 677 | - | |
| 678 | - | |
| 679 | -def test_analysis_cache_key_isolated_by_analysis_kind(): | |
| 680 | - product = { | |
| 681 | - "id": "1", | |
| 682 | - "title": "dress", | |
| 683 | - "brief": "soft cotton", | |
| 684 | - "description": "summer dress", | |
| 685 | - } | |
| 686 | - | |
| 687 | - content_key = product_enrich._make_analysis_cache_key(product, "zh", "content") | |
| 688 | - taxonomy_key = product_enrich._make_analysis_cache_key(product, "zh", "taxonomy") | |
| 689 | - | |
| 690 | - assert content_key != taxonomy_key | |
| 691 | - | |
| 692 | - | |
| 693 | -def test_analysis_cache_key_changes_when_prompt_contract_changes(): | |
| 694 | - product = { | |
| 695 | - "id": "1", | |
| 696 | - "title": "dress", | |
| 697 | - "brief": "soft cotton", | |
| 698 | - "description": "summer dress", | |
| 699 | - } | |
| 700 | - | |
| 701 | - original_key = product_enrich._make_analysis_cache_key(product, "zh", "taxonomy") | |
| 702 | - | |
| 703 | - with mock.patch.object( | |
| 704 | - product_enrich, | |
| 705 | - "USER_INSTRUCTION_TEMPLATE", | |
| 706 | - "Please return JSON only. Language: {language}", | |
| 707 | - ): | |
| 708 | - changed_key = product_enrich._make_analysis_cache_key(product, "zh", "taxonomy") | |
| 709 | - | |
| 710 | - assert original_key != changed_key | |
| 711 | - | |
| 712 | - | |
| 713 | -def test_build_prompt_input_text_appends_brief_and_description_for_short_title(): | |
| 714 | - product = { | |
| 715 | - "title": "T恤", | |
| 716 | - "brief": "夏季透气纯棉短袖,舒适亲肤", | |
| 717 | - "description": "100%棉,圆领版型,适合日常通勤与休闲穿搭。", | |
| 718 | - } | |
| 719 | - | |
| 720 | - text = product_enrich._build_prompt_input_text(product) | |
| 721 | - | |
| 722 | - assert text.startswith("T恤") | |
| 723 | - assert "夏季透气纯棉短袖" in text | |
| 724 | - assert "100%棉" in text | |
| 725 | - | |
| 726 | - | |
| 727 | -def test_build_prompt_input_text_truncates_non_cjk_by_words(): | |
| 728 | - product = { | |
| 729 | - "title": "dress", | |
| 730 | - "brief": " ".join(f"brief{i}" for i in range(50)), | |
| 731 | - "description": " ".join(f"desc{i}" for i in range(50)), | |
| 732 | - } | |
| 733 | - | |
| 734 | - text = product_enrich._build_prompt_input_text(product) | |
| 735 | - | |
| 736 | - assert len(text.split()) <= product_enrich.PROMPT_INPUT_MAX_WORDS |