ai-saas / saas-search

08 Apr, 2026

1 commit

1fdab52d This change adjusts the BM25 parameters used by the combined query. ... Browse Dir »

Previously, both `b` and `k1` were set to `0.0`. The original intention
was to avoid two common issues in e-commerce search relevance:

1. Over-penalizing longer product titles
   In product search, a shorter title should not automatically rank
higher just because BM25 favors shorter fields. For example, for a query
like “遥控车”, a product whose title is simply “遥控车” is not
necessarily a better candidate than a product with a slightly longer but
more descriptive title. In practice, extremely short titles may even
indicate lower-quality catalog data.

2. Over-rewarding repeated occurrences of the same term
   For longer queries such as “遥控喷雾翻滚多功能车玩具车”, the default
BM25 behavior may give too much weight to a term that appears multiple
times (for example “遥控”), even when other important query terms such
as “喷雾” or “翻滚” are missing. This can cause products with repeated
partial matches to outrank products that actually cover more of the user
intent.

Setting both parameters to zero was an intentional way to suppress
length normalization and term-frequency amplification. However, after
introducing a `combined_fields` query, this configuration becomes too
aggressive. Since `combined_fields` scores multiple fields as a unified
relevance signal, completely disabling both effects may also remove
useful ranking information, especially when we still want documents
matching more query terms across fields to be distinguishable from
weaker matches.

This update therefore relaxes the previous setting and reintroduces a
controlled amount of BM25 normalization/scoring behavior. The goal is to
keep the original intent — avoiding short-title bias and excessive
repeated-term gain — while allowing the combined query to better
preserve meaningful relevance differences across candidates.

Expected effect:
- reduce the bias toward unnaturally short product titles
- limit score inflation caused by repeated occurrences of the same term
- improve ranking stability for `combined_fields` queries
- better reward candidates that cover more of the overall query intent,
  instead of those that only repeat a subset of terms

2026-04-08 14:39:54 +0800

30 Mar, 2026

5 commits

de98daa3 多模态召回优化 Browse Dir »

tangwang
2026-03-30 20:59:37 +0800

6c35aff8 索引结构修改： ... Browse Dir »

一、tags字段改支持多语言：
spu表tags字段，跟title走一样的翻译逻辑，填入原始语言、zh、en。

检查以下字段，都跟title一样走翻译逻辑
title
keywords
tags
brief
description
vendor
category_path
category_name_text

二、/indexer/enrich-content接口的修改
1.
请求参数，把language去掉，因为我返回的内容直接对应索引结构，不用你做处理了，因此不需要指定语言，降低耦合。
2. 返回 enriched_attributes enriched_tags
   qanchors三个字段，按原始内容填入。
3. enriched_tags是本次新增的，注意区别于tags字段。tags字段来源于mysql
   spu表，enriched_tags是本接口返回的。

三、specifications的value，需要翻译，也是需要填中英文：
{
  "specifications": [
    {
      "sku_id": "sku-red-s",
      "name": "color",
      "value_keyword": "красный",
      "value_text": {
        "zh": "红色",
        "en": "red"
      }
    }
  ]
}

2026-03-30 19:12:26 +0800

d350861f 索引结构修改 Browse Dir »

tangwang
2026-03-30 18:59:50 +0800
fca871fb 索引字段修改 Browse Dir »

tangwang
2026-03-30 17:25:33 +0800
36cf0ef9 es索引结果修改 Browse Dir »

tangwang
2026-03-30 16:20:24 +0800

27 Mar, 2026

1 commit

1681a135 image_embeddin sizeg配置跟服务统一到768 Browse Dir »

tangwang
2026-03-27 11:26:54 +0800

21 Mar, 2026

1 commit

00c8ddb9 suggest rank optimize Browse Dir »

tangwang
2026-03-21 19:41:23 +0800

16 Mar, 2026

1 commit

2d17b98e sugg Browse Dir »

tangwang
2026-03-16 17:38:34 +0800

10 Mar, 2026

1 commit

654f20d1 分词改为ik Browse Dir »

tangwang
2026-03-10 17:05:31 +0800

05 Mar, 2026

1 commit

52ae85fb 1. ES docs ... Browse Dir »
```
2. 修改索引配置： 向量改为bf16
```
tangwang
2026-03-05 23:27:42 +0800

02 Mar, 2026

1 commit

d54b0467 feat: 为商品索引补充 qanchors 与语义属性 ... Browse Dir »

- 新增 indexer/process_products.analyze_products 接口，封装对 DashScope LLM 的调用逻辑，支持 zh/en/de/ru/fr 多语言输出，并结构化返回 anchor_text、tags、usage_scene、target_audience、season、key_attributes、material、features 等字段，既可脚本批处理也可在索引阶段按需调用。
- 在 SPUDocumentTransformer 中引入 _fill_llm_attributes，按租户 index_languages 与支持语言的交集，对每个 SPU/语言调用 analyze_products，默认开启 LLM 增强：成功时为 doc 填充 qanchors.{lang}（query 风格锚文本）以及 nested semantic_attributes(lang/name/value) 语义维度信息，失败时仅打 warn 日志并优雅降级，不影响主索引链路。
- 扩展 search_products.json mapping，在商品文档上新增 nested 字段 semantic_attributes(lang/name/value)，以通用三元组形式承载 LLM 抽取的场景、人群、材质、风格等可变维度，为后续按语义维度做过滤和分面聚合提供统一的结构化载体。
- 编写 indexer/ANCHORS_AND_SEMANTIC_ATTRIBUTES.md 设计文档，系统梳理 qanchors 与 semantic_attributes 的字段含义、索引与多语言策略、与 suggestion 构建器的集成方式以及在搜索过滤/分面中的推荐用法，方便后续维护与功能扩展。

Made-with: Cursor

2026-03-02 23:24:41 +0800

06 Jan, 2026

2 commits

430ffe48 多语言索引调整 Browse Dir »

tangwang
2026-01-06 20:20:09 +0800

d7d48f52 改动（mapping + 灌入结构） ... Browse Dir »

mappings/search_products.json：把原来的 title_zh/title_en/brief_zh/... 改成 按语言 key 的对象结构（ /products/_doc/1 { "title": {"en":...} } ）
同时在这些字段下 预置了全部 analyzer 语言:
arabic, armenian, basque, brazilian, bulgarian, catalan, chinese, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, italian, norwegian, persian, portuguese, romanian, russian, spanish, swedish, turkish, thai

实现为 type: object + properties，同时满足“按语言灌入”和“按语言 analyzer”。
索引灌入（全量/增量/transformer）已同步改完
indexer/document_transformer.py：输出从 title_zh/title_en/... 改为：
title: {<primary_lang>: 原文, en?: 翻译, zh?: 翻译}
brief/description/vendor 同理
category_path/category_name_text 也改为语言对象（避免查询侧继续依赖旧字段）
indexer/incremental_service.py：embedding 取值从 title_en/title_zh 改为从 title 对象里优先取 en，否则取 zh，否则取任一可用语言。
查询侧与配置、API/文档已同步
search/es_query_builder.py：查询字段统一改成点路径：title.zh / title.en / vendor.zh / vendor.zh.keyword / category_name_text.zh 等。
config/config.yaml：field boosts / indexes 里的字段名同步为新点路径。
API & formatter：
api/result_formatter.py 已支持新结构（并保留对旧 *_zh/_en 的兼容兜底）。
api/models.py、相关 docs/examples 里的 vendor_zh.keyword 等已更新为 vendor.zh.keyword。
文档/脚本：docs/、README.md、scripts/ 里所有旧字段名引用已批量替换为新结构。

2026-01-06 19:42:20 +0800

26 Dec, 2025

1 commit

15eae5ee add image_embedding_512 Browse Dir »

tangwang
2025-12-26 16:49:32 +0800

22 Dec, 2025

1 commit

9c712e64 增加索引字段qanchors keywords Browse Dir »

tangwang
2025-12-22 12:32:06 +0800

19 Dec, 2025

1 commit

5c2b70a2 search_products.json Browse Dir »

tangwang
2025-12-19 11:19:58 +0800

25 Nov, 2025

1 commit

59b0a342 创建手写 mapping JSON ... Browse Dir »

mappings/search_products.json - 完整的ES索引配置（settings + mappings）
基于 docs/索引字段说明v2-mapping结构.md
简化 mapping_generator.py
移除所有config依赖
直接使用 load_mapping() 从JSON文件加载
保留工具函数：create_index_if_not_exists, delete_index_if_exists, update_mapping
更新数据导入脚本
scripts/ingest_shoplazza.py - 移除ConfigLoader依赖
直接使用 load_mapping() 和 DEFAULT_INDEX_NAME
更新indexer模块
indexer/__init__.py - 更新导出
indexer/bulk_indexer.py - 简化IndexingPipeline，移除config依赖
创建查询配置常量
search/query_config.py - 硬编码字段列表和配置项

使用方式
创建索引：
from indexer.mapping_generator import load_mapping, create_index_if_not_existsfrom utils.es_client import ESClientes_client = ESClient(hosts=["http://localhost:9200"])mapping = load_mapping()create_index_if_not_exists(es_client, "search_products", mapping)
数据导入：
python scripts/ingest_shoplazza.py \    --db-host localhost \    --db-database saas \    --db-username root \    --db-password password \    --tenant-id "1" \    --es-host http://localhost:9200 \    --recreate

注意事项
修改mapping：直接编辑 mappings/search_products.json
字段映射：spu_transformer.py 中硬编码，与mapping保持一致
config目录：保留但不再使用，可后续清理
search模块：仍依赖config

2025-11-25 22:46:51 +0800