09 Apr, 2026

1 commit

  • category_taxonomy_profile
    
    - 原 analysis_kinds
      混用了“增强类型”(content/taxonomy)与“品类特定配置”,不利于扩展不同品类的
    taxonomy 分析(如 3C、家居等)
    - 新增 enrichment_scopes 参数:支持 generic(通用增强,产出
      qanchors/enriched_tags/enriched_attributes)和
    category_taxonomy(品类增强,产出 enriched_taxonomy_attributes)
    - 新增 category_taxonomy_profile 参数:指定品类增强使用哪套
      profile(当前内置 apparel),每套 profile 包含独立的
    prompt、输出列定义、解析规则及缓存版本
    - 保留 analysis_kinds 作为兼容别名,避免破坏现有调用方
    - 重构内部 taxonomy 分析为 profile registry 模式:新增
      _get_taxonomy_schema(profile_name) 函数,根据 profile 动态返回对应的
    AnalysisSchema
    - 缓存 key 现在按“分析类型 + profile + schema 指纹 +
      输入字段哈希”隔离,确保不同品类、不同 prompt 版本自动失效
    - 更新 API 文档及微服务接口文档,明确新参数语义与使用示例
    
    技术细节:
    - 修改入口:api/routes/indexer.py 中 enrich-content
      端点,解析新参数并向下传递
    - 核心逻辑:indexer/product_enrich.py 中 enrich_products_batch 增加
      profile 参数;_process_batch_for_schema 根据 scope 和 profile 动态获取
    schema
    - 兼容层:若请求同时提供 analysis_kinds,则映射为
      enrichment_scopes(content→generic,taxonomy→category_taxonomy),category_taxonomy_profile
    默认为 "apparel"
    - 测试覆盖:新增 enrichment_scopes 组合、profile 切换及兼容模式测试
    tangwang
     

08 Apr, 2026

1 commit

  • Previously, both `b` and `k1` were set to `0.0`. The original intention
    was to avoid two common issues in e-commerce search relevance:
    
    1. Over-penalizing longer product titles
       In product search, a shorter title should not automatically rank
    higher just because BM25 favors shorter fields. For example, for a query
    like “遥控车”, a product whose title is simply “遥控车” is not
    necessarily a better candidate than a product with a slightly longer but
    more descriptive title. In practice, extremely short titles may even
    indicate lower-quality catalog data.
    
    2. Over-rewarding repeated occurrences of the same term
       For longer queries such as “遥控喷雾翻滚多功能车玩具车”, the default
    BM25 behavior may give too much weight to a term that appears multiple
    times (for example “遥控”), even when other important query terms such
    as “喷雾” or “翻滚” are missing. This can cause products with repeated
    partial matches to outrank products that actually cover more of the user
    intent.
    
    Setting both parameters to zero was an intentional way to suppress
    length normalization and term-frequency amplification. However, after
    introducing a `combined_fields` query, this configuration becomes too
    aggressive. Since `combined_fields` scores multiple fields as a unified
    relevance signal, completely disabling both effects may also remove
    useful ranking information, especially when we still want documents
    matching more query terms across fields to be distinguishable from
    weaker matches.
    
    This update therefore relaxes the previous setting and reintroduces a
    controlled amount of BM25 normalization/scoring behavior. The goal is to
    keep the original intent — avoiding short-title bias and excessive
    repeated-term gain — while allowing the combined query to better
    preserve meaningful relevance differences across candidates.
    
    Expected effect:
    - reduce the bias toward unnaturally short product titles
    - limit score inflation caused by repeated occurrences of the same term
    - improve ranking stability for `combined_fields` queries
    - better reward candidates that cover more of the overall query intent,
      instead of those that only repeat a subset of terms
    tangwang