indexer/product_enrich%E6%A8%A1%E5%9D%97%E8%AF%B4%E6%98%8E.md

# 内容富化模块说明
本文说明商品内容富化模块的职责、入口、输出结构，以及当前 taxonomy profile 的设计约束。
## 1. 模块目标
内容富化模块负责基于商品文本调用 LLM，生成以下索引字段：
- `qanchors`
- `enriched_tags`
- `enriched_attributes`
- `enriched_taxonomy_attributes`
模块追求的设计原则：
- 单一职责：只负责内容理解与结构化输出，不负责 CSV 读写
- 输出对齐 ES mapping：返回结构可直接写入 `search_products`
- 配置化扩展：taxonomy profile 通过数据配置扩展，而不是散落条件分支
- 代码精简：只面向正常使用方式，避免为了不合理调用堆叠补丁逻辑
## 2. 主要文件
- [product_enrich.py](/data/saas-search/indexer/product_enrich.py)
  运行时主逻辑，负责批处理、缓存、prompt 组装、LLM 调用、markdown 解析、输出整理
- [product_enrich_prompts.py](/data/saas-search/indexer/product_enrich_prompts.py)
  prompt 模板与 taxonomy profile 配置
- [document_transformer.py](/data/saas-search/indexer/document_transformer.py)
  在内部索引构建链路中调用内容富化模块，把结果回填到 ES doc
- [taxonomy.md](/data/saas-search/indexer/taxonomy.md)
  taxonomy 设计说明与字段清单
## 3. 对外入口
### 3.1 Python 入口
核心入口：
```python
build_index_content_fields(
    items,
    tenant_id=None,
    enrichment_scopes=None,
    category_taxonomy_profile=None,
)
```
输入最小要求：
- `id` 或 `spu_id`
- `title`
可选输入：
- `brief`
- `description`
- `image_url`
关键参数：
- `enrichment_scopes`
  可选 `generic`、`category_taxonomy`
- `category_taxonomy_profile`
  taxonomy profile；默认 `apparel`
### 3.2 HTTP 入口
API 路由：
- `POST /indexer/enrich-content`
对应文档：
- [搜索API对接指南-05-索引接口（Indexer）](/data/saas-search/docs/搜索API对接指南-05-索引接口（Indexer）.md)
- [搜索API对接指南-07-微服务接口（Embedding-Reranker-Translation）](/data/saas-search/docs/搜索API对接指南-07-微服务接口（Embedding-Reranker-Translation）.md)
## 4. 输出结构
返回结果与 ES mapping 对齐：
```json
{
  "id": "223167",
  "qanchors": {
    "zh": ["短袖T恤", "纯棉"],
    "en": ["t-shirt", "cotton"]
  },
  "enriched_tags": {
    "zh": ["短袖", "纯棉"],
    "en": ["short sleeve", "cotton"]
  },
  "enriched_attributes": [
    {
      "name": "enriched_tags",
      "value": {
        "zh": ["短袖", "纯棉"],
        "en": ["short sleeve", "cotton"]
      }
    }
  ],
  "enriched_taxonomy_attributes": [
    {
      "name": "Product Type",
      "value": {
        "zh": ["T恤"],
        "en": ["t-shirt"]
      }
    }
  ]
}
```
说明：
- `generic` 部分固定输出核心索引语言 `zh`、`en`
- `taxonomy` 部分同样统一输出 `zh`、`en`
## 5. Taxonomy profile
当前支持：
- `apparel`
- `3c`
- `bags`
- `pet_supplies`
- `electronics`
- `outdoor`
- `home_appliances`
- `home_living`
- `wigs`
- `beauty`
- `accessories`
- `toys`
- `shoes`
- `sports`
- `others`
统一约束：
- 所有 profile 都返回 `zh` + `en`
- profile 只决定 taxonomy 字段集合，不再决定输出语言
- 所有 profile 都配置中英文字段名，prompt/header 结构保持一致
## 6. 内部索引链路的当前约束
在内部 ES 文档构建链路里，`document_transformer` 当前调用内容富化时，taxonomy profile 暂时固定传：
```python
category_taxonomy_profile="apparel"
```
这是一种显式、可控、代码更干净的临时策略。
当前代码里已保留 TODO：
- 后续从数据库读取租户真实所属行业
- 再用该行业替换固定的 `apparel`
当前不做“根据商品类目文本自动猜 profile”的隐式逻辑，避免增加冗余代码与不必要的不确定性。
## 7. 缓存与批处理
缓存键由以下信息共同决定：
- `analysis_kind`
- `target_lang`
- prompt/schema 版本指纹
- prompt 实际输入文本
批处理规则：
- 单次 LLM 调用最多 20 条
- 上层允许传更大批次，模块内部自动拆批
- uncached batch 可并发执行