ai-saas / saas-search

08 Apr, 2026

2 commits

f27a8d90 ES文档维护 Browse File »

tangwang
2026-04-08 16:47:20 +0800

1fdab52d This change adjusts the BM25 parameters used by the combined query. ... Browse File »

Previously, both `b` and `k1` were set to `0.0`. The original intention
was to avoid two common issues in e-commerce search relevance:

1. Over-penalizing longer product titles
   In product search, a shorter title should not automatically rank
higher just because BM25 favors shorter fields. For example, for a query
like “遥控车”, a product whose title is simply “遥控车” is not
necessarily a better candidate than a product with a slightly longer but
more descriptive title. In practice, extremely short titles may even
indicate lower-quality catalog data.

2. Over-rewarding repeated occurrences of the same term
   For longer queries such as “遥控喷雾翻滚多功能车玩具车”, the default
BM25 behavior may give too much weight to a term that appears multiple
times (for example “遥控”), even when other important query terms such
as “喷雾” or “翻滚” are missing. This can cause products with repeated
partial matches to outrank products that actually cover more of the user
intent.

Setting both parameters to zero was an intentional way to suppress
length normalization and term-frequency amplification. However, after
introducing a `combined_fields` query, this configuration becomes too
aggressive. Since `combined_fields` scores multiple fields as a unified
relevance signal, completely disabling both effects may also remove
useful ranking information, especially when we still want documents
matching more query terms across fields to be distinguishable from
weaker matches.

This update therefore relaxes the previous setting and reintroduces a
controlled amount of BM25 normalization/scoring behavior. The goal is to
keep the original intent — avoiding short-title bias and excessive
repeated-term gain — while allowing the combined query to better
preserve meaningful relevance differences across candidates.

Expected effect:
- reduce the bias toward unnaturally short product titles
- limit score inflation caused by repeated occurrences of the same term
- improve ranking stability for `combined_fields` queries
- better reward candidates that cover more of the overall query intent,
  instead of those that only repeat a subset of terms

2026-04-08 14:39:54 +0800

03 Apr, 2026

1 commit

ccbdf870 enriched_attributes.value字段参与搜索 Browse File »

tangwang
2026-04-03 21:11:50 +0800

01 Apr, 2026

2 commits

9df421ed 基于eval框架开始调参 Browse File »

tangwang
2026-04-01 20:05:22 +0800
42024409 评估框架-批量打标 Browse File »

tangwang
2026-04-01 16:57:58 +0800

30 Mar, 2026

2 commits

36cf0ef9 es索引结果修改 Browse File »

tangwang
2026-03-30 16:20:24 +0800

c3425429 在以下文件中完成精排/融合清理工作：[search/rerank_client.py](/data/saas-search/search/rerank_clie… ... Browse File »

…nt.py)、[search/searcher.py](/data/saas-search/search/searcher.py)、[frontend/static/js/app.js](/data/saas-search/frontend/static/js/app.js)
以及
[tests/test_rerank_client.py](/data/saas-search/tests/test_rerank_client.py)。

主要修复内容如下：
- 精排现依据融合阶段得分进行排序，而非仅依据原始的 `fine_score`。
- 最终重排不再依赖独立的 `fine_scores`
  数组（该数组在精排排序后可能产生同步偏差），而是直接读取命中结果附带的
`_fine_score`。
-
精排与最终重排现均通过同一计算路径生成融合调试信息，该路径同时也决定实际排序结果，从而保证记录逻辑与生产逻辑保持一致。
-
调试信息载荷更加清晰：精排和最终重排阶段都会暴露融合输入/因子以及规范的
`fusion_summary`，前端界面现在会渲染该摘要信息。

主要问题：阶段逻辑重复且存在并行的数据通道：一个通道用于计算排序，另一个通道用于组装调试字段，还有第三个通道用于传递辅助数组。这造成了潜在的差异风险。本次重构通过将阶段得分作为唯一事实来源，并让调试/前端直接消费其输出而非事后重构，降低了该风险。

验证结果：
- `./.venv/bin/python -m pytest -q tests/test_rerank_client.py
  tests/test_search_rerank_window.py`
- `./.venv/bin/python -m py_compile search/rerank_client.py
  search/searcher.py`

结果：`22 passed`。

当前的主流程：

1. Query 解析
2. ES 召回
3. 粗排：只用 ES 内部文本/KNN 信号
4. 款式 SKU 选择 + title suffix
5. 精排：轻量 reranker + 文本/KNN 融合
6. 最终 rerank：重 reranker + fine score + 文本/KNN 融合
7. 分页、补全字段、格式化返回

主控代码在 [searcher.py](/data/saas-search/search/searcher.py)，打分与
rerank 细节在
[rerank_client.py](/data/saas-search/search/rerank_client.py)，配置定义在
[schema.py](/data/saas-search/config/schema.py) 和
[config.yaml](/data/saas-search/config/config.yaml)。

**先看入口怎么决定走哪条路**
在 [searcher.py:348](/data/saas-search/search/searcher.py#L348)
开始，`search()` 先读租户语言、开关、窗口大小。
关键判断在 [searcher.py:364](/data/saas-search/search/searcher.py#L364)
到 [searcher.py:372](/data/saas-search/search/searcher.py#L372)：

- `rerank_window` 现在是 80，见
  [config.yaml:256](/data/saas-search/config/config.yaml#L256)
- `coarse_rank.input_window` 是 700，`output_window` 是 240，见
  [config.yaml:231](/data/saas-search/config/config.yaml#L231)
- `fine_rank.input_window` 是 240，`output_window` 是 80，见
  [config.yaml:245](/data/saas-search/config/config.yaml#L245)

所以如果请求满足 `from_ + size <= rerank_window`，就进入完整漏斗：
- ES 实际取前 `700`
- 粗排后留 `240`
- 精排后留 `80`
- 最终 rerank 也只处理这 `80`
- 最后再做分页切片

如果请求页超出 80，就不走后面的多阶段漏斗，直接按 ES 原逻辑返回。

2026-03-30 12:16:05 +0800

27 Mar, 2026

1 commit

1681a135 image_embeddin sizeg配置跟服务统一到768 Browse File »

tangwang
2026-03-27 11:26:54 +0800

20 Mar, 2026

1 commit

272aeabe 调参 Browse File »

tangwang
2026-03-20 17:37:04 +0800

10 Mar, 2026

1 commit

30f2a10b ansj -> ik Browse File »

tangwang
2026-03-10 21:24:41 +0800

06 Mar, 2026

1 commit

484adbfe adapt ubuntu; conda -> venv Browse File »

tangwang
2026-03-06 18:50:20 +0800

02 Mar, 2026

1 commit

89638140 重构 indexer 文档构建接口与测试示例 ... Browse File »

- 新增 /indexer/build-docs 与 /indexer/build-docs-from-db 接口：前者接收上游传入的 SPU/SKU/Option 原始行数据构建 ES doc（不写 ES），后者在测试场景下基于 tenant_id+spu_ids 内部查库并复用同一套文档构建逻辑
- 调整增量与全量索引 SQL 与聚合逻辑：移除 shoplazza_product_spu.compare_at_price 读取，统一从 SKU 表聚合最大 compare_at_price，修复 1054 列不存在错误，保证 ES 字段 compare_at_price 来源与索引字段说明v2 保持一致
- 更新 SPUDocumentTransformer：完善价格区间计算、compare_at_price 聚合以及多语言字段输出，确保输出结构与 mappings/search_products.json、Java 侧 ProductIndexDocument 完全对齐
- 为 indexer 模块补充 README 与 prompts：系统化说明 Java 调度 + Python 富化的职责划分、翻译缓存方案（Redis translation:{tenant_id}:{target_lang}:{md5(text)}）以及 HTTP 接口使用方式
- 更新顶层 README、搜索API对接指南与测试Pipeline说明：增加关于 indexer 专用服务（serve-indexer, 端口6004）、正式文档构建接口以及手动链路验证（MySQL → build-docs → ES 查询对比）的说明
- 清理并修正 ES 诊断脚本 docs/常用查询 - ES.md：统一改为 per-tenant 索引 search_products_tenant_{tenant_id}，修正过期字段名（keywords 等）和分面聚合字段（去掉 .keyword，使用当前 mapping 中的字段）

Made-with: Cursor

2026-03-02 19:00:34 +0800

06 Feb, 2026

1 commit

985d7fe3 为 filters 中所有字段加上 `*_all` 语义 ... Browse File »

---

 1. `search/es_query_builder.py`：`_all` 分支

- **普通字段**（如 `tags_all`, `category1_name_all`）：
  - 键以 `_all` 结尾时，先去掉后缀得到 ES 字段名。
  - 若值为**数组**：生成 `bool.must`，内含多个 `term`，即多值 **AND**。
  - 若值为**单值**：生成一个 `term`。
- **specifications_all**：
  - 值为 `[{name, value}, ...]` 时，为每一项生成一个 nested 查询，全部放入同一个 `bool.must`，即列表内所有规格条件都要满足（AND）。

原有逻辑不变：不带 `_all` 的字段，数组仍为 OR（`terms`），单值仍为 `term`。

 2. `api/models.py`：filters 说明

- 在 `filters` 的 `description` 中补充：
  - 字段名加 `_all` 表示 AND（如 `tags_all: ['A','B']` 表示同时包含 A 和 B）。
  - `specifications_all` 表示列表内所有规格条件都要满足。

 3. `docs/搜索API对接指南.md`：文档

- 在 3.3.1 开头说明：任意字段名可加 `_all` 后缀表示多值 AND。
- 在格式示例中增加 `tags_all`、`category1_name_all` 示例。
- 在「支持的值类型」中说明：数组在带 `_all` 时为 AND。
- 新增小节「`*_all` 语义（多值 AND）」：说明用法及 `specifications_all` 行为。
- 在「常用过滤字段」中补充：以上字段均可加 `_all` 后缀。

---

**使用示例**

```json
{
  "filters": {
    "tags": ["手机", "促销"],
    "tags_all": ["手机", "促销", "新品"]
  }
}
```

- `tags`：命中「手机」或「促销」或两者都有（OR）。
- `tags_all`：必须同时包含「手机」「促销」「新品」三个标签（AND）。

2026-02-06 11:23:06 +0800

06 Jan, 2026

2 commits

80f87e57 多语言索引修改对应的索引创建、数据灌入脚本、文档同步修改 Browse File »

tangwang
2026-01-06 22:40:42 +0800

d7d48f52 改动（mapping + 灌入结构） ... Browse File »

mappings/search_products.json：把原来的 title_zh/title_en/brief_zh/... 改成 按语言 key 的对象结构（ /products/_doc/1 { "title": {"en":...} } ）
同时在这些字段下 预置了全部 analyzer 语言:
arabic, armenian, basque, brazilian, bulgarian, catalan, chinese, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, italian, norwegian, persian, portuguese, romanian, russian, spanish, swedish, turkish, thai

实现为 type: object + properties，同时满足“按语言灌入”和“按语言 analyzer”。
索引灌入（全量/增量/transformer）已同步改完
indexer/document_transformer.py：输出从 title_zh/title_en/... 改为：
title: {<primary_lang>: 原文, en?: 翻译, zh?: 翻译}
brief/description/vendor 同理
category_path/category_name_text 也改为语言对象（避免查询侧继续依赖旧字段）
indexer/incremental_service.py：embedding 取值从 title_en/title_zh 改为从 title 对象里优先取 en，否则取 zh，否则取任一可用语言。
查询侧与配置、API/文档已同步
search/es_query_builder.py：查询字段统一改成点路径：title.zh / title.en / vendor.zh / vendor.zh.keyword / category_name_text.zh 等。
config/config.yaml：field boosts / indexes 里的字段名同步为新点路径。
API & formatter：
api/result_formatter.py 已支持新结构（并保留对旧 *_zh/_en 的兼容兜底）。
api/models.py、相关 docs/examples 里的 vendor_zh.keyword 等已更新为 vendor.zh.keyword。
文档/脚本：docs/、README.md、scripts/ 里所有旧字段名引用已批量替换为新结构。

2026-01-06 19:42:20 +0800

04 Jan, 2026

1 commit

472cca0c doc Browse File »

tangwang
2026-01-04 18:15:10 +0800

20 Dec, 2025

1 commit

70dab99f add logs Browse File »

tangwang
2025-12-20 14:50:13 +0800

19 Dec, 2025

2 commits

d6606d7a 清理旧代码，具体如下： ... Browse File »

1. 删除 IndexingPipeline 类
文件：indexer/bulk_indexer.py
删除：IndexingPipeline 类（第201-259行）
删除：不再需要的 load_mapping 导入
2. 删除 main.py 中的旧代码
删除：cmd_ingest() 函数（整个函数）
删除：ingest 子命令定义
删除：main() 中对 ingest 命令的处理
删除：不再需要的 pandas 导入
更新：文档字符串，移除 ingest 命令说明
3. 删除旧的数据导入脚本
删除：data/customer1/ingest_customer1.py（依赖已废弃的 DataTransformer 和 IndexingPipeline）

2025-12-19 08:57:36 +0800

5ac64fc7 多语言查询 Browse File »

tangwang
2025-12-19 08:32:19 +0800

02 Dec, 2025

2 commits

33839b37 属性值参与搜索： ... Browse File »

1. 加了一个配置searchable_option_dimensions，功能是配置子sku的option1_value option2_value option3_value 哪些参与检索（进索引、以及在线搜索的时候将对应字段纳入搜索field）。格式为list，选择三者中的一个或多个。

2. 索引 @mappings/search_products.json 要加3个字段 option1_values option2_values option3_values，各自的 数据灌入（mysql->ES）的模块也要修改，这个字段是对子sku的option1_value option2_value option3_value分别提取去抽后得到的list。
searchable_option_dimensions 中配置的，才进索引，比如 searchable_option_dimensions = ['option1'] 则 只对option1提取属性值去重组织list进入索引，其余两个字段为空

3. 在线 对应的将 searchable_option_dimensions 中 对应的索引字段纳入 multi_match 的 fields，权重设为0.5 （各个字段的权重配置放到一起集中管理）

1. 配置文件改动 (config/config.yaml)
✅ 在 spu_config 中添加了 searchable_option_dimensions 配置项，默认值为 ['option1', 'option2', 'option3']
✅ 添加了3个新字段定义：option1_values, option2_values, option3_values，类型为 KEYWORD，权重为 0.5
✅ 在 default 索引域的 fields 列表中添加了这3个字段，使其参与搜索
2. ES索引Mapping改动 (mappings/search_products.json)
✅ 添加了3个新字段：option1_values, option2_values, option3_values，类型为 keyword
3. 配置加载器改动 (config/config_loader.py)
✅ 在 SPUConfig 类中添加了 searchable_option_dimensions 字段
✅ 更新了配置解析逻辑，支持读取 searchable_option_dimensions
✅ 更新了配置转换为字典的逻辑
4. 数据灌入改动 (indexer/spu_transformer.py)
✅ 在初始化时加载配置，获取 searchable_option_dimensions
✅ 在 _transform_spu_to_doc 方法中添加逻辑：
从所有子SKU中提取 option1, option2, option3 值
去重后存入 option1_values, option2_values, option3_values
根据配置决定哪些字段实际写入数据（未配置的字段写空数组）

=

2025-12-02 18:35:50 +0800

c4263d93 支持 sku_filter_dimension ... Browse File »
```
sku_filter_dimension=color
sku_filter_dimension=option1 / option2 /option3
以上两种方式都可以
```
tangwang
2025-12-02 15:40:32 +0800

29 Nov, 2025

1 commit

a10a89a3 构造测试数据用于测试分类和三种属性的分面。 Browse File »

tangwang
2025-11-29 09:53:31 +0800