09 Apr, 2026
2 commits
-
category_taxonomy_profile - 原 analysis_kinds 混用了“增强类型”(content/taxonomy)与“品类特定配置”,不利于扩展不同品类的 taxonomy 分析(如 3C、家居等) - 新增 enrichment_scopes 参数:支持 generic(通用增强,产出 qanchors/enriched_tags/enriched_attributes)和 category_taxonomy(品类增强,产出 enriched_taxonomy_attributes) - 新增 category_taxonomy_profile 参数:指定品类增强使用哪套 profile(当前内置 apparel),每套 profile 包含独立的 prompt、输出列定义、解析规则及缓存版本 - 保留 analysis_kinds 作为兼容别名,避免破坏现有调用方 - 重构内部 taxonomy 分析为 profile registry 模式:新增 _get_taxonomy_schema(profile_name) 函数,根据 profile 动态返回对应的 AnalysisSchema - 缓存 key 现在按“分析类型 + profile + schema 指纹 + 输入字段哈希”隔离,确保不同品类、不同 prompt 版本自动失效 - 更新 API 文档及微服务接口文档,明确新参数语义与使用示例 技术细节: - 修改入口:api/routes/indexer.py 中 enrich-content 端点,解析新参数并向下传递 - 核心逻辑:indexer/product_enrich.py 中 enrich_products_batch 增加 profile 参数;_process_batch_for_schema 根据 scope 和 profile 动态获取 schema - 兼容层:若请求同时提供 analysis_kinds,则映射为 enrichment_scopes(content→generic,taxonomy→category_taxonomy),category_taxonomy_profile 默认为 "apparel" - 测试覆盖:新增 enrichment_scopes 组合、profile 切换及兼容模式测试
-
- `/indexer/enrich-content` 路由`enriched_taxonomy_attributes` 与 `enriched_attributes` 一并返回 - 新增请求参数 `analysis_kinds`(可选,默认 `["content", "taxonomy"]`),允许调用方按需选择内容分析类型,为后续扩展和成本控制预留空间 - 重构缓存策略:将 `content` 与 `taxonomy` 两类分析的缓存完全隔离,缓存 key 包含 prompt 模板、表头、输出字段定义(即 schema 指纹),确保提示词或解析规则变更时自动失效 - 缓存 key 仅依赖真正参与 LLM 输入的字段(`title`、`brief`、`description`),`image_url`、`tenant_id`、`spu_id` 不再污染缓存键,提高缓存命中率 - 更新 API 文档(`docs/搜索API对接指南-05-索引接口(Indexer).md`),说明新增参数与返回字段 技术细节: - 路由层调整:在 `api/routes/indexer.py` 的 enrich-content 端点中,将 `product_enrich.enrich_products_batch` 返回的 `enriched_taxonomy_attributes` 字段显式加入 HTTP 响应体 - `analysis_kinds` 参数透传至底层 `enrich_products_batch`,支持按需跳过某一类分析(如仅需 taxonomy 时减少 LLM 调用) - 缓存指纹计算位于 `product_enrich.py` 的 `_get_cache_key` 函数,对每种 `AnalysisSchema` 独立生成;版本号通过 `schema.version` 或 prompt 内容哈希隐式包含 - 测试覆盖:新增 `analysis_kinds` 组合场景及缓存隔离测试
08 Apr, 2026
1 commit
-
Previously, both `b` and `k1` were set to `0.0`. The original intention was to avoid two common issues in e-commerce search relevance: 1. Over-penalizing longer product titles In product search, a shorter title should not automatically rank higher just because BM25 favors shorter fields. For example, for a query like “遥控车”, a product whose title is simply “遥控车” is not necessarily a better candidate than a product with a slightly longer but more descriptive title. In practice, extremely short titles may even indicate lower-quality catalog data. 2. Over-rewarding repeated occurrences of the same term For longer queries such as “遥控喷雾翻滚多功能车玩具车”, the default BM25 behavior may give too much weight to a term that appears multiple times (for example “遥控”), even when other important query terms such as “喷雾” or “翻滚” are missing. This can cause products with repeated partial matches to outrank products that actually cover more of the user intent. Setting both parameters to zero was an intentional way to suppress length normalization and term-frequency amplification. However, after introducing a `combined_fields` query, this configuration becomes too aggressive. Since `combined_fields` scores multiple fields as a unified relevance signal, completely disabling both effects may also remove useful ranking information, especially when we still want documents matching more query terms across fields to be distinguishable from weaker matches. This update therefore relaxes the previous setting and reintroduces a controlled amount of BM25 normalization/scoring behavior. The goal is to keep the original intent — avoiding short-title bias and excessive repeated-term gain — while allowing the combined query to better preserve meaningful relevance differences across candidates. Expected effect: - reduce the bias toward unnaturally short product titles - limit score inflation caused by repeated occurrences of the same term - improve ranking stability for `combined_fields` queries - better reward candidates that cover more of the overall query intent, instead of those that only repeat a subset of terms
30 Mar, 2026
2 commits
20 Mar, 2026
1 commit