# 内容富化模块说明

本文说明商品内容富化模块的职责、入口、输出结构，以及当前 taxonomy profile 的设计约束。

## 1. 模块目标

内容富化模块负责基于商品文本调用 LLM，生成以下索引字段：

- `qanchors`
- `enriched_tags`
- `enriched_attributes`
- `enriched_taxonomy_attributes`

模块追求的设计原则：

- 单一职责：只负责内容理解与结构化输出，不负责 CSV 读写
- 输出对齐 ES mapping：返回结构可直接写入 `search_products`
- 配置化扩展：taxonomy profile 通过数据配置扩展，而不是散落条件分支
- 代码精简：只面向正常使用方式，避免为了不合理调用堆叠补丁逻辑

## 2. 主要文件

- [product_enrich.py](/data/saas-search/indexer/product_enrich.py)
  运行时主逻辑，负责批处理、缓存、prompt 组装、LLM 调用、markdown 解析、输出整理
- [product_enrich_prompts.py](/data/saas-search/indexer/product_enrich_prompts.py)
  prompt 模板与 taxonomy profile 配置
- [document_transformer.py](/data/saas-search/indexer/document_transformer.py)
  在内部索引构建链路中调用内容富化模块，把结果回填到 ES doc
- [taxonomy.md](/data/saas-search/indexer/taxonomy.md)
  taxonomy 设计说明与字段清单

## 3. 对外入口

### 3.1 Python 入口

核心入口：

```python
build_index_content_fields(
    items,
    tenant_id=None,
    enrichment_scopes=None,
    category_taxonomy_profile=None,
)
```

输入最小要求：

- `id` 或 `spu_id`
- `title`

可选输入：

- `brief`
- `description`
- `image_url`

关键参数：

- `enrichment_scopes`
  可选 `generic`、`category_taxonomy`
- `category_taxonomy_profile`
  taxonomy profile；默认 `apparel`

### 3.2 HTTP 入口

API 路由：

- `POST /indexer/enrich-content`

对应文档：

- [搜索API对接指南-05-索引接口（Indexer）](/data/saas-search/docs/搜索API对接指南-05-索引接口（Indexer）.md)
- [搜索API对接指南-07-微服务接口（Embedding-Reranker-Translation）](/data/saas-search/docs/搜索API对接指南-07-微服务接口（Embedding-Reranker-Translation）.md)

## 4. 输出结构

返回结果与 ES mapping 对齐：

```json
{
  "id": "223167",
  "qanchors": {
    "zh": ["短袖T恤", "纯棉"],
    "en": ["t-shirt", "cotton"]
  },
  "enriched_tags": {
    "zh": ["短袖", "纯棉"],
    "en": ["short sleeve", "cotton"]
  },
  "enriched_attributes": [
    {
      "name": "enriched_tags",
      "value": {
        "zh": ["短袖", "纯棉"],
        "en": ["short sleeve", "cotton"]
      }
    }
  ],
  "enriched_taxonomy_attributes": [
    {
      "name": "Product Type",
      "value": {
        "zh": ["T恤"],
        "en": ["t-shirt"]
      }
    }
  ]
}
```

说明：

- `generic` 部分固定输出核心索引语言 `zh`、`en`
- `taxonomy` 部分同样统一输出 `zh`、`en`

## 5. Taxonomy profile

当前支持：

- `apparel`
- `3c`
- `bags`
- `pet_supplies`
- `electronics`
- `outdoor`
- `home_appliances`
- `home_living`
- `wigs`
- `beauty`
- `accessories`
- `toys`
- `shoes`
- `sports`
- `others`

统一约束：

- 所有 profile 都返回 `zh` + `en`
- profile 只决定 taxonomy 字段集合，不再决定输出语言
- 所有 profile 都配置中英文字段名，prompt/header 结构保持一致

## 6. 内部索引链路的当前约束

在内部 ES 文档构建链路里，`document_transformer` 当前调用内容富化时，taxonomy profile 暂时固定传：

```python
category_taxonomy_profile="apparel"
```

这是一种显式、可控、代码更干净的临时策略。

当前代码里已保留 TODO：

- 后续从数据库读取租户真实所属行业
- 再用该行业替换固定的 `apparel`

当前不做“根据商品类目文本自动猜 profile”的隐式逻辑，避免增加冗余代码与不必要的不确定性。

## 7. 缓存与批处理

缓存键由以下信息共同决定：

- `analysis_kind`
- `target_lang`
- prompt/schema 版本指纹
- prompt 实际输入文本

批处理规则：

- 单次 LLM 调用最多 20 条
- 上层允许传更大批次，模块内部自动拆批
- uncached batch 可并发执行