08 Apr, 2026
1 commit
-
Previously, both `b` and `k1` were set to `0.0`. The original intention was to avoid two common issues in e-commerce search relevance: 1. Over-penalizing longer product titles In product search, a shorter title should not automatically rank higher just because BM25 favors shorter fields. For example, for a query like “遥控车”, a product whose title is simply “遥控车” is not necessarily a better candidate than a product with a slightly longer but more descriptive title. In practice, extremely short titles may even indicate lower-quality catalog data. 2. Over-rewarding repeated occurrences of the same term For longer queries such as “遥控喷雾翻滚多功能车玩具车”, the default BM25 behavior may give too much weight to a term that appears multiple times (for example “遥控”), even when other important query terms such as “喷雾” or “翻滚” are missing. This can cause products with repeated partial matches to outrank products that actually cover more of the user intent. Setting both parameters to zero was an intentional way to suppress length normalization and term-frequency amplification. However, after introducing a `combined_fields` query, this configuration becomes too aggressive. Since `combined_fields` scores multiple fields as a unified relevance signal, completely disabling both effects may also remove useful ranking information, especially when we still want documents matching more query terms across fields to be distinguishable from weaker matches. This update therefore relaxes the previous setting and reintroduces a controlled amount of BM25 normalization/scoring behavior. The goal is to keep the original intent — avoiding short-title bias and excessive repeated-term gain — while allowing the combined query to better preserve meaningful relevance differences across candidates. Expected effect: - reduce the bias toward unnaturally short product titles - limit score inflation caused by repeated occurrences of the same term - improve ranking stability for `combined_fields` queries - better reward candidates that cover more of the overall query intent, instead of those that only repeat a subset of terms
30 Mar, 2026
5 commits
-
一、tags字段改支持多语言: spu表tags字段,跟title走一样的翻译逻辑,填入原始语言、zh、en。 检查以下字段,都跟title一样走翻译逻辑 title keywords tags brief description vendor category_path category_name_text 二、/indexer/enrich-content接口的修改 1. 请求参数,把language去掉,因为我返回的内容直接对应索引结构,不用你做处理了,因此不需要指定语言,降低耦合。 2. 返回 enriched_attributes enriched_tags qanchors三个字段,按原始内容填入。 3. enriched_tags是本次新增的,注意区别于tags字段。tags字段来源于mysql spu表,enriched_tags是本接口返回的。 三、specifications的value,需要翻译,也是需要填中英文: { "specifications": [ { "sku_id": "sku-red-s", "name": "color", "value_keyword": "красный", "value_text": { "zh": "红色", "en": "red" } } ] }
27 Mar, 2026
1 commit
21 Mar, 2026
1 commit
16 Mar, 2026
1 commit
10 Mar, 2026
1 commit
05 Mar, 2026
1 commit
-
2. 修改索引配置: 向量改为bf16
02 Mar, 2026
1 commit
-
- 新增 indexer/process_products.analyze_products 接口,封装对 DashScope LLM 的调用逻辑,支持 zh/en/de/ru/fr 多语言输出,并结构化返回 anchor_text、tags、usage_scene、target_audience、season、key_attributes、material、features 等字段,既可脚本批处理也可在索引阶段按需调用。 - 在 SPUDocumentTransformer 中引入 _fill_llm_attributes,按租户 index_languages 与支持语言的交集,对每个 SPU/语言调用 analyze_products,默认开启 LLM 增强:成功时为 doc 填充 qanchors.{lang}(query 风格锚文本)以及 nested semantic_attributes(lang/name/value) 语义维度信息,失败时仅打 warn 日志并优雅降级,不影响主索引链路。 - 扩展 search_products.json mapping,在商品文档上新增 nested 字段 semantic_attributes(lang/name/value),以通用三元组形式承载 LLM 抽取的场景、人群、材质、风格等可变维度,为后续按语义维度做过滤和分面聚合提供统一的结构化载体。 - 编写 indexer/ANCHORS_AND_SEMANTIC_ATTRIBUTES.md 设计文档,系统梳理 qanchors 与 semantic_attributes 的字段含义、索引与多语言策略、与 suggestion 构建器的集成方式以及在搜索过滤/分面中的推荐用法,方便后续维护与功能扩展。 Made-with: Cursor
06 Jan, 2026
2 commits
-
mappings/search_products.json:把原来的 title_zh/title_en/brief_zh/... 改成 按语言 key 的对象结构( /products/_doc/1 { "title": {"en":...} } ) 同时在这些字段下 预置了全部 analyzer 语言: arabic, armenian, basque, brazilian, bulgarian, catalan, chinese, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, italian, norwegian, persian, portuguese, romanian, russian, spanish, swedish, turkish, thai 实现为 type: object + properties,同时满足“按语言灌入”和“按语言 analyzer”。 索引灌入(全量/增量/transformer)已同步改完 indexer/document_transformer.py:输出从 title_zh/title_en/... 改为: title: {<primary_lang>: 原文, en?: 翻译, zh?: 翻译} brief/description/vendor 同理 category_path/category_name_text 也改为语言对象(避免查询侧继续依赖旧字段) indexer/incremental_service.py:embedding 取值从 title_en/title_zh 改为从 title 对象里优先取 en,否则取 zh,否则取任一可用语言。 查询侧与配置、API/文档已同步 search/es_query_builder.py:查询字段统一改成点路径:title.zh / title.en / vendor.zh / vendor.zh.keyword / category_name_text.zh 等。 config/config.yaml:field boosts / indexes 里的字段名同步为新点路径。 API & formatter: api/result_formatter.py 已支持新结构(并保留对旧 *_zh/_en 的兼容兜底)。 api/models.py、相关 docs/examples 里的 vendor_zh.keyword 等已更新为 vendor.zh.keyword。 文档/脚本:docs/、README.md、scripts/ 里所有旧字段名引用已批量替换为新结构。
26 Dec, 2025
1 commit
22 Dec, 2025
1 commit
19 Dec, 2025
1 commit
25 Nov, 2025
1 commit
-
mappings/search_products.json - 完整的ES索引配置(settings + mappings) 基于 docs/索引字段说明v2-mapping结构.md 简化 mapping_generator.py 移除所有config依赖 直接使用 load_mapping() 从JSON文件加载 保留工具函数:create_index_if_not_exists, delete_index_if_exists, update_mapping 更新数据导入脚本 scripts/ingest_shoplazza.py - 移除ConfigLoader依赖 直接使用 load_mapping() 和 DEFAULT_INDEX_NAME 更新indexer模块 indexer/__init__.py - 更新导出 indexer/bulk_indexer.py - 简化IndexingPipeline,移除config依赖 创建查询配置常量 search/query_config.py - 硬编码字段列表和配置项 使用方式 创建索引: from indexer.mapping_generator import load_mapping, create_index_if_not_existsfrom utils.es_client import ESClientes_client = ESClient(hosts=["http://localhost:9200"])mapping = load_mapping()create_index_if_not_exists(es_client, "search_products", mapping) 数据导入: python scripts/ingest_shoplazza.py \ --db-host localhost \ --db-database saas \ --db-username root \ --db-password password \ --tenant-id "1" \ --es-host http://localhost:9200 \ --recreate 注意事项 修改mapping:直接编辑 mappings/search_products.json 字段映射:spu_transformer.py 中硬编码,与mapping保持一致 config目录:保留但不再使用,可后续清理 search模块:仍依赖config