ai-saas / saas-search

09 Apr, 2026

1 commit

5aaf0c7d feat(indexer): 完善 enriched_taxonomy_attributes 接口输出及缓存设计 ... Browse File »

- `/indexer/enrich-content` 路由`enriched_taxonomy_attributes` 与
  `enriched_attributes` 一并返回
- 新增请求参数 `analysis_kinds`（可选，默认 `["content",
  "taxonomy"]`），允许调用方按需选择内容分析类型，为后续扩展和成本控制预留空间
- 重构缓存策略：将 `content` 与 `taxonomy` 两类分析的缓存完全隔离，缓存
  key 包含 prompt 模板、表头、输出字段定义（即 schema
指纹），确保提示词或解析规则变更时自动失效
- 缓存 key 仅依赖真正参与 LLM
  输入的字段（`title`、`brief`、`description`），`image_url`、`tenant_id`、`spu_id`
不再污染缓存键，提高缓存命中率
- 更新 API
  文档（`docs/搜索API对接指南-05-索引接口（Indexer）.md`），说明新增参数与返回字段

技术细节：
- 路由层调整：在 `api/routes/indexer.py` 的 enrich-content 端点中，将
  `product_enrich.enrich_products_batch` 返回的
`enriched_taxonomy_attributes` 字段显式加入 HTTP 响应体
- `analysis_kinds` 参数透传至底层
  `enrich_products_batch`，支持按需跳过某一类分析（如仅需 taxonomy
时减少 LLM 调用）
- 缓存指纹计算位于 `product_enrich.py` 的 `_get_cache_key` 函数，对每种
  `AnalysisSchema` 独立生成；版本号通过 `schema.version` 或 prompt
内容哈希隐式包含
- 测试覆盖：新增 `analysis_kinds` 组合场景及缓存隔离测试

2026-04-09 13:09:14 +0800

08 Apr, 2026

1 commit

1fdab52d This change adjusts the BM25 parameters used by the combined query. ... Browse File »

Previously, both `b` and `k1` were set to `0.0`. The original intention
was to avoid two common issues in e-commerce search relevance:

1. Over-penalizing longer product titles
   In product search, a shorter title should not automatically rank
higher just because BM25 favors shorter fields. For example, for a query
like “遥控车”, a product whose title is simply “遥控车” is not
necessarily a better candidate than a product with a slightly longer but
more descriptive title. In practice, extremely short titles may even
indicate lower-quality catalog data.

2. Over-rewarding repeated occurrences of the same term
   For longer queries such as “遥控喷雾翻滚多功能车玩具车”, the default
BM25 behavior may give too much weight to a term that appears multiple
times (for example “遥控”), even when other important query terms such
as “喷雾” or “翻滚” are missing. This can cause products with repeated
partial matches to outrank products that actually cover more of the user
intent.

Setting both parameters to zero was an intentional way to suppress
length normalization and term-frequency amplification. However, after
introducing a `combined_fields` query, this configuration becomes too
aggressive. Since `combined_fields` scores multiple fields as a unified
relevance signal, completely disabling both effects may also remove
useful ranking information, especially when we still want documents
matching more query terms across fields to be distinguishable from
weaker matches.

This update therefore relaxes the previous setting and reintroduces a
controlled amount of BM25 normalization/scoring behavior. The goal is to
keep the original intent — avoiding short-title bias and excessive
repeated-term gain — while allowing the combined query to better
preserve meaningful relevance differences across candidates.

Expected effect:
- reduce the bias toward unnaturally short product titles
- limit score inflation caused by repeated occurrences of the same term
- improve ranking stability for `combined_fields` queries
- better reward candidates that cover more of the overall query intent,
  instead of those that only repeat a subset of terms

2026-04-08 14:39:54 +0800

30 Mar, 2026

2 commits

d350861f 索引结构修改 Browse File »

tangwang
2026-03-30 18:59:50 +0800
36cf0ef9 es索引结果修改 Browse File »

tangwang
2026-03-30 16:20:24 +0800

20 Mar, 2026

1 commit

0342d897 搜索API对接指南拆分 Browse File »

tangwang
2026-03-20 08:48:23 +0800