08 Apr, 2026

1 commit

  • Previously, both `b` and `k1` were set to `0.0`. The original intention
    was to avoid two common issues in e-commerce search relevance:
    
    1. Over-penalizing longer product titles
       In product search, a shorter title should not automatically rank
    higher just because BM25 favors shorter fields. For example, for a query
    like “遥控车”, a product whose title is simply “遥控车” is not
    necessarily a better candidate than a product with a slightly longer but
    more descriptive title. In practice, extremely short titles may even
    indicate lower-quality catalog data.
    
    2. Over-rewarding repeated occurrences of the same term
       For longer queries such as “遥控喷雾翻滚多功能车玩具车”, the default
    BM25 behavior may give too much weight to a term that appears multiple
    times (for example “遥控”), even when other important query terms such
    as “喷雾” or “翻滚” are missing. This can cause products with repeated
    partial matches to outrank products that actually cover more of the user
    intent.
    
    Setting both parameters to zero was an intentional way to suppress
    length normalization and term-frequency amplification. However, after
    introducing a `combined_fields` query, this configuration becomes too
    aggressive. Since `combined_fields` scores multiple fields as a unified
    relevance signal, completely disabling both effects may also remove
    useful ranking information, especially when we still want documents
    matching more query terms across fields to be distinguishable from
    weaker matches.
    
    This update therefore relaxes the previous setting and reintroduces a
    controlled amount of BM25 normalization/scoring behavior. The goal is to
    keep the original intent — avoiding short-title bias and excessive
    repeated-term gain — while allowing the combined query to better
    preserve meaningful relevance differences across candidates.
    
    Expected effect:
    - reduce the bias toward unnaturally short product titles
    - limit score inflation caused by repeated occurrences of the same term
    - improve ranking stability for `combined_fields` queries
    - better reward candidates that cover more of the overall query intent,
      instead of those that only repeat a subset of terms
    tangwang
     

30 Mar, 2026

5 commits

  • tangwang
     
  • 一、tags字段改支持多语言:
    spu表tags字段,跟title走一样的翻译逻辑,填入原始语言、zh、en。
    
    检查以下字段,都跟title一样走翻译逻辑
    title
    keywords
    tags
    brief
    description
    vendor
    category_path
    category_name_text
    
    二、/indexer/enrich-content接口的修改
    1.
    请求参数,把language去掉,因为我返回的内容直接对应索引结构,不用你做处理了,因此不需要指定语言,降低耦合。
    2. 返回 enriched_attributes enriched_tags
       qanchors三个字段,按原始内容填入。
    3. enriched_tags是本次新增的,注意区别于tags字段。tags字段来源于mysql
       spu表,enriched_tags是本接口返回的。
    
    三、specifications的value,需要翻译,也是需要填中英文:
    {
      "specifications": [
        {
          "sku_id": "sku-red-s",
          "name": "color",
          "value_keyword": "красный",
          "value_text": {
            "zh": "红色",
            "en": "red"
          }
        }
      ]
    }
    tangwang
     
  • tangwang
     
  • tangwang
     
  • tangwang
     

27 Mar, 2026

1 commit


21 Mar, 2026

1 commit


16 Mar, 2026

1 commit


10 Mar, 2026

1 commit


05 Mar, 2026

1 commit


02 Mar, 2026

1 commit

  • - 新增 indexer/process_products.analyze_products 接口,封装对 DashScope LLM 的调用逻辑,支持 zh/en/de/ru/fr 多语言输出,并结构化返回 anchor_text、tags、usage_scene、target_audience、season、key_attributes、material、features 等字段,既可脚本批处理也可在索引阶段按需调用。
    - 在 SPUDocumentTransformer 中引入 _fill_llm_attributes,按租户 index_languages 与支持语言的交集,对每个 SPU/语言调用 analyze_products,默认开启 LLM 增强:成功时为 doc 填充 qanchors.{lang}(query 风格锚文本)以及 nested semantic_attributes(lang/name/value) 语义维度信息,失败时仅打 warn 日志并优雅降级,不影响主索引链路。
    - 扩展 search_products.json mapping,在商品文档上新增 nested 字段 semantic_attributes(lang/name/value),以通用三元组形式承载 LLM 抽取的场景、人群、材质、风格等可变维度,为后续按语义维度做过滤和分面聚合提供统一的结构化载体。
    - 编写 indexer/ANCHORS_AND_SEMANTIC_ATTRIBUTES.md 设计文档,系统梳理 qanchors 与 semantic_attributes 的字段含义、索引与多语言策略、与 suggestion 构建器的集成方式以及在搜索过滤/分面中的推荐用法,方便后续维护与功能扩展。
    
    Made-with: Cursor
    tangwang
     

06 Jan, 2026

2 commits

  • tangwang
     
  • mappings/search_products.json:把原来的 title_zh/title_en/brief_zh/... 改成 按语言 key 的对象结构( /products/_doc/1 { "title": {"en":...} } )
    同时在这些字段下 预置了全部 analyzer 语言:
    arabic, armenian, basque, brazilian, bulgarian, catalan, chinese, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, italian, norwegian, persian, portuguese, romanian, russian, spanish, swedish, turkish, thai
    
    实现为 type: object + properties,同时满足“按语言灌入”和“按语言 analyzer”。
    索引灌入(全量/增量/transformer)已同步改完
    indexer/document_transformer.py:输出从 title_zh/title_en/... 改为:
    title: {<primary_lang>: 原文, en?: 翻译, zh?: 翻译}
    brief/description/vendor 同理
    category_path/category_name_text 也改为语言对象(避免查询侧继续依赖旧字段)
    indexer/incremental_service.py:embedding 取值从 title_en/title_zh 改为从 title 对象里优先取 en,否则取 zh,否则取任一可用语言。
    查询侧与配置、API/文档已同步
    search/es_query_builder.py:查询字段统一改成点路径:title.zh / title.en / vendor.zh / vendor.zh.keyword / category_name_text.zh 等。
    config/config.yaml:field boosts / indexes 里的字段名同步为新点路径。
    API & formatter:
    api/result_formatter.py 已支持新结构(并保留对旧 *_zh/_en 的兼容兜底)。
    api/models.py、相关 docs/examples 里的 vendor_zh.keyword 等已更新为 vendor.zh.keyword。
    文档/脚本:docs/、README.md、scripts/ 里所有旧字段名引用已批量替换为新结构。
    tangwang
     

26 Dec, 2025

1 commit


22 Dec, 2025

1 commit


19 Dec, 2025

1 commit


25 Nov, 2025

1 commit

  • mappings/search_products.json - 完整的ES索引配置(settings + mappings)
    基于 docs/索引字段说明v2-mapping结构.md
    简化 mapping_generator.py
    移除所有config依赖
    直接使用 load_mapping() 从JSON文件加载
    保留工具函数:create_index_if_not_exists, delete_index_if_exists, update_mapping
    更新数据导入脚本
    scripts/ingest_shoplazza.py - 移除ConfigLoader依赖
    直接使用 load_mapping() 和 DEFAULT_INDEX_NAME
    更新indexer模块
    indexer/__init__.py - 更新导出
    indexer/bulk_indexer.py - 简化IndexingPipeline,移除config依赖
    创建查询配置常量
    search/query_config.py - 硬编码字段列表和配置项
    
    使用方式
    创建索引:
    from indexer.mapping_generator import load_mapping, create_index_if_not_existsfrom utils.es_client import ESClientes_client = ESClient(hosts=["http://localhost:9200"])mapping = load_mapping()create_index_if_not_exists(es_client, "search_products", mapping)
    数据导入:
    python scripts/ingest_shoplazza.py \    --db-host localhost \    --db-database saas \    --db-username root \    --db-password password \    --tenant-id "1" \    --es-host http://localhost:9200 \    --recreate
    
    注意事项
    修改mapping:直接编辑 mappings/search_products.json
    字段映射:spu_transformer.py 中硬编码,与mapping保持一致
    config目录:保留但不再使用,可后续清理
    search模块:仍依赖config
    tangwang