product_enrich模块说明.md 4.08 KB
Edit Raw Blame History


内容富化模块说明
本文说明商品内容富化模块的职责、入口、输出结构，以及当前 taxonomy profile 的设计约束。
1. 模块目标
内容富化模块负责基于商品文本调用 LLM，生成以下索引字段：


qanchors
enriched_tags
enriched_attributes
enriched_taxonomy_attributes


模块追求的设计原则：


单一职责：只负责内容理解与结构化输出，不负责 CSV 读写
输出对齐 ES mapping：返回结构可直接写入 search_products
配置化扩展：taxonomy profile 通过数据配置扩展，而不是散落条件分支
代码精简：只面向正常使用方式，避免为了不合理调用堆叠补丁逻辑

2. 主要文件

product_enrich.py
运行时主逻辑，负责批处理、缓存、prompt 组装、LLM 调用、markdown 解析、输出整理
product_enrich_prompts.py
prompt 模板与 taxonomy profile 配置
document_transformer.py
在内部索引构建链路中调用内容富化模块，把结果回填到 ES doc
taxonomy.md
taxonomy 设计说明与字段清单

3. 对外入口
3.1 Python 入口
核心入口：
build_index_content_fields(
    items,
    tenant_id=None,
    enrichment_scopes=None,
    category_taxonomy_profile=None,
)


输入最小要求：


id 或 spu_id
title


可选输入：


brief
description
image_url


关键参数：


enrichment_scopes
可选 generic、category_taxonomy
category_taxonomy_profile
taxonomy profile；默认 apparel

3.2 HTTP 入口
API 路由：


POST /indexer/enrich-content


对应文档：


搜索API对接指南-05-索引接口（Indexer）
搜索API对接指南-07-微服务接口（Embedding-Reranker-Translation）

4. 输出结构
返回结果与 ES mapping 对齐：
{
  "id": "223167",
  "qanchors": {
    "zh": ["短袖T恤", "纯棉"],
    "en": ["t-shirt", "cotton"]
  },
  "enriched_tags": {
    "zh": ["短袖", "纯棉"],
    "en": ["short sleeve", "cotton"]
  },
  "enriched_attributes": [
    {
      "name": "enriched_tags",
      "value": {
        "zh": ["短袖", "纯棉"],
        "en": ["short sleeve", "cotton"]
      }
    }
  ],
  "enriched_taxonomy_attributes": [
    {
      "name": "Product Type",
      "value": {
        "zh": ["T恤"],
        "en": ["t-shirt"]
      }
    }
  ]
}


说明：


generic 部分固定输出核心索引语言 zh、en
taxonomy 部分同样统一输出 zh、en

5. Taxonomy profile
当前支持：


apparel
3c
bags
pet_supplies
electronics
outdoor
home_appliances
home_living
wigs
beauty
accessories
toys
shoes
sports
others


统一约束：


所有 profile 都返回 zh + en
profile 只决定 taxonomy 字段集合，不再决定输出语言
所有 profile 都配置中英文字段名，prompt/header 结构保持一致

6. 内部索引链路的当前约束
在内部 ES 文档构建链路里，document_transformer 当前调用内容富化时，taxonomy profile 暂时固定传：
category_taxonomy_profile="apparel"


这是一种显式、可控、代码更干净的临时策略。

当前代码里已保留 TODO：


后续从数据库读取租户真实所属行业
再用该行业替换固定的 apparel


当前不做“根据商品类目文本自动猜 profile”的隐式逻辑，避免增加冗余代码与不必要的不确定性。
7. 缓存与批处理
缓存键由以下信息共同决定：


analysis_kind
target_lang
prompt/schema 版本指纹
prompt 实际输入文本


批处理规则：


单次 LLM 调用最多 20 条
上层允许传更大批次，模块内部自动拆批
uncached batch 可并发执行