Compare View

switch
from
...
to
 
Commits (19)
  • Previously, both `b` and `k1` were set to `0.0`. The original intention
    was to avoid two common issues in e-commerce search relevance:
    
    1. Over-penalizing longer product titles
       In product search, a shorter title should not automatically rank
    higher just because BM25 favors shorter fields. For example, for a query
    like “遥控车”, a product whose title is simply “遥控车” is not
    necessarily a better candidate than a product with a slightly longer but
    more descriptive title. In practice, extremely short titles may even
    indicate lower-quality catalog data.
    
    2. Over-rewarding repeated occurrences of the same term
       For longer queries such as “遥控喷雾翻滚多功能车玩具车”, the default
    BM25 behavior may give too much weight to a term that appears multiple
    times (for example “遥控”), even when other important query terms such
    as “喷雾” or “翻滚” are missing. This can cause products with repeated
    partial matches to outrank products that actually cover more of the user
    intent.
    
    Setting both parameters to zero was an intentional way to suppress
    length normalization and term-frequency amplification. However, after
    introducing a `combined_fields` query, this configuration becomes too
    aggressive. Since `combined_fields` scores multiple fields as a unified
    relevance signal, completely disabling both effects may also remove
    useful ranking information, especially when we still want documents
    matching more query terms across fields to be distinguishable from
    weaker matches.
    
    This update therefore relaxes the previous setting and reintroduces a
    controlled amount of BM25 normalization/scoring behavior. The goal is to
    keep the original intent — avoiding short-title bias and excessive
    repeated-term gain — while allowing the combined query to better
    preserve meaningful relevance differences across candidates.
    
    Expected effect:
    - reduce the bias toward unnaturally short product titles
    - limit score inflation caused by repeated occurrences of the same term
    - improve ranking stability for `combined_fields` queries
    - better reward candidates that cover more of the overall query intent,
      instead of those that only repeat a subset of terms
    tangwang
     
  • tangwang
     
  • tangwang
     
  • tangwang
     
  • tangwang
     
  • tangwang
     
  • 字段生成
    
    - 新增分类法属性富化能力,遵循 enriched_attributes
      相同的字段结构和处理逻辑,仅提示词和解析维度不同
    - 引入 AnalysisSchema
      抽象类,使内容富化(content)与分类法富化(taxonomy)共享批处理、缓存、提示词构建、Markdown
    解析及归一化流程
    - 重构 product_enrich.py 中原有的富化管道,将通用逻辑抽取至
      _process_batch_for_schema、_parse_markdown_to_attributes
    等函数,消除代码重复
    - 在 product_enrich_prompts.py
      中添加分类法提示词模板(TAXONOMY_ANALYSIS_PROMPT)及 Markdown
    表头定义(TAXONOMY_HEADERS)
    - 修复 Markdown
      解析器在空单元格时的行为:原实现会跳过空单元格导致列错位,现改为保留空值,确保稀疏的分类法属性列正确对齐
    - 更新 document_transformer.py 中 build_index_content_fields 函数,将
      enriched_taxonomy_attributes(中/英)写入最终索引文档
    - 调整相关单元测试(test_product_enrich_partial_mode.py
      等)以覆盖新字段路径,测试通过(14 passed)
    
    技术细节:
    - AnalysisSchema 包含
      schema_name、prompt_template、headers、field_name_prefix 等元数据
    -
    缓存键区分内容/分类法:`enrich:{schema_name}:{product_id}`,避免缓存污染
    - 分类法解析使用与 enriched_attributes
      相同的嵌套结构:`{"attribute_key": "value"}`,支持多行表格
    - 批处理大小与重试逻辑保持与原有内容富化一致
    tangwang
     
  • - `/indexer/enrich-content` 路由`enriched_taxonomy_attributes` 与
      `enriched_attributes` 一并返回
    - 新增请求参数 `analysis_kinds`(可选,默认 `["content",
      "taxonomy"]`),允许调用方按需选择内容分析类型,为后续扩展和成本控制预留空间
    - 重构缓存策略:将 `content` 与 `taxonomy` 两类分析的缓存完全隔离,缓存
      key 包含 prompt 模板、表头、输出字段定义(即 schema
    指纹),确保提示词或解析规则变更时自动失效
    - 缓存 key 仅依赖真正参与 LLM
      输入的字段(`title`、`brief`、`description`),`image_url`、`tenant_id`、`spu_id`
    不再污染缓存键,提高缓存命中率
    - 更新 API
      文档(`docs/搜索API对接指南-05-索引接口(Indexer).md`),说明新增参数与返回字段
    
    技术细节:
    - 路由层调整:在 `api/routes/indexer.py` 的 enrich-content 端点中,将
      `product_enrich.enrich_products_batch` 返回的
    `enriched_taxonomy_attributes` 字段显式加入 HTTP 响应体
    - `analysis_kinds` 参数透传至底层
      `enrich_products_batch`,支持按需跳过某一类分析(如仅需 taxonomy
    时减少 LLM 调用)
    - 缓存指纹计算位于 `product_enrich.py` 的 `_get_cache_key` 函数,对每种
      `AnalysisSchema` 独立生成;版本号通过 `schema.version` 或 prompt
    内容哈希隐式包含
    - 测试覆盖:新增 `analysis_kinds` 组合场景及缓存隔离测试
    tangwang
     
  • category_taxonomy_profile
    
    - 原 analysis_kinds
      混用了“增强类型”(content/taxonomy)与“品类特定配置”,不利于扩展不同品类的
    taxonomy 分析(如 3C、家居等)
    - 新增 enrichment_scopes 参数:支持 generic(通用增强,产出
      qanchors/enriched_tags/enriched_attributes)和
    category_taxonomy(品类增强,产出 enriched_taxonomy_attributes)
    - 新增 category_taxonomy_profile 参数:指定品类增强使用哪套
      profile(当前内置 apparel),每套 profile 包含独立的
    prompt、输出列定义、解析规则及缓存版本
    - 保留 analysis_kinds 作为兼容别名,避免破坏现有调用方
    - 重构内部 taxonomy 分析为 profile registry 模式:新增
      _get_taxonomy_schema(profile_name) 函数,根据 profile 动态返回对应的
    AnalysisSchema
    - 缓存 key 现在按“分析类型 + profile + schema 指纹 +
      输入字段哈希”隔离,确保不同品类、不同 prompt 版本自动失效
    - 更新 API 文档及微服务接口文档,明确新参数语义与使用示例
    
    技术细节:
    - 修改入口:api/routes/indexer.py 中 enrich-content
      端点,解析新参数并向下传递
    - 核心逻辑:indexer/product_enrich.py 中 enrich_products_batch 增加
      profile 参数;_process_batch_for_schema 根据 scope 和 profile 动态获取
    schema
    - 兼容层:若请求同时提供 analysis_kinds,则映射为
      enrichment_scopes(content→generic,taxonomy→category_taxonomy),category_taxonomy_profile
    默认为 "apparel"
    - 测试覆盖:新增 enrichment_scopes 组合、profile 切换及兼容模式测试
    tangwang
     
  • 本次迭代对检索系统的内容复化模块进行了较大规模的重构,将原先硬编码的“仅服饰(apparel)”品类拓展至
    taxonomy.md
    中定义的所有品类,同时优化了代码结构,降低了扩展新品类的成本。核心设计采用注册表模式(profile
    registry),按品类 profile
    分组进行批处理,并明确区分双语(zh+en)与仅英文(en)输出策略。
    
    【修改内容】
    
    1. 品类支持范围扩展
       -
    新增支持的品类:3c、bags、pet_supplies、electronics、outdoor、home_appliances、home_living、wigs、beauty、accessories、toys、shoes、sports、others
       - 所有新品类在 taxonomy 输出阶段仅返回 en 字段,避免多语言字段膨胀
       - 保留服饰(apparel)品类的双语输出(zh + en),维持原有业务兼容性
    
    2. 核心代码重构
       - `indexer/product_enrich.py`
         - 新增 `TAXONOMY_PROFILES`
           注册表,以数据驱动方式定义每个品类的输出语言、prompt
    映射、taxonomy 字段集合
         - 重写 `_enrich_taxonomy_batch`:按 profile 分组批量调用
           LLM,避免为每个品类编写独立分支
         - 引入 `_infer_profile_from_category()` 函数,从 SPU 的 category
           字段自动推断所属 profile(用于内部索引路径,解决混合目录默认
    fallback 到服饰的问题)
       - `indexer/product_enrich_prompts.py`
         - 将原有单一服饰 prompt 重构为 `PROMPT_TEMPLATES` 字典,按 profile
           存储不同提示词
         - 所有非服饰品类共享一套精简提示模板,仅要求输出 en 字段
       - `indexer/document_transformer.py`
         - 在构建 enrichment 请求时传递 category 信息,供下游按 profile 路由
         - 调整 `_build_enrich_batch` 逻辑,使批量请求支持混合品类并正确分组
       - `indexer/indexer.py`(API 层)
         - `/indexer/enrich-content` 接口的请求模型增加可选的
           `category_profile`
    字段,允许调用方显式指定品类;未指定时由服务端自动推断
         - 更新参数校验与错误处理,新增对 `others` 等兜底品类的支持
    
    3. 文档同步更新
       - `docs/搜索API对接指南-05-索引接口(Indexer).md`:增加品类 profile
         参数说明,标注非服饰品类 taxonomy 仅返回 en 字段
       -
    `docs/搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md`:更新
    enrichment 微服务的调用示例,体现多品类分组批处理
       - `taxonomy.md`:补充各品类的字段清单,明确 en
         字段为所有非服饰品类的唯一输出
    
    【技术细节】
    
    - **注册表设计**:
      ```python
      TAXONOMY_PROFILES = {
          "apparel": {"lang": ["zh", "en"], "prompt_key": "apparel",
    "fields": [...]},
          "3c": {"lang": ["en"], "prompt_key": "default", "fields": [...]},
          \# ...
      }
      ```
      新增品类只需在注册表中添加一项,并确保 `PROMPT_TEMPLATES` 中存在对应的
    prompt_key,无需修改控制流逻辑。
    
    - **按 profile 分组批处理**:
      - 原有实现:所有产品混在一起,使用同一套服饰
        prompt,导致非服饰产品被错误填充。
      - 重构后:`_enrich_taxonomy_batch` 先根据每个产品的 profile
        分组,每组独立构造 LLM
    请求,响应结果再按原始顺序合并。分组粒度可配置,避免小分组带来的过多请求开销。
    
    - **自动品类推断**:
      - 对于内部索引(非显式调用 enrichment 接口的场景),通过
        `_infer_profile_from_category` 解析 SPU 的 `category_l1/l2/l3`
    字段,映射到最匹配的
    profile。映射规则基于关键词匹配(如“手机”->“3c”,“狗粮”->“pet_supplies”),未匹配时
    fallback 到 `apparel` 以保证系统平稳过渡。
    
    - **输出字段裁剪**:
      - 由于 Elasticsearch mapping 中 `enriched_taxonomy_attributes.value`
        字段仅存储单个值(不分语言),非服饰品类的 LLM
    输出直接写入该字段;服饰品类则使用动态模板 `value.zh` 和
    `value.en`。代码中通过 `_apply_lang_output` 函数统一处理。
    
    - **代码量与可维护性**:
      - 虽然因新增大量品类定义导致总行数略有增长(~+180
        行),但条件分支数量从 5 处减少到 1 处(仅 profile
    查找)。新增品类的平均成本仅为注册表 3 行 + prompt 模板 10
    行,无需改动核心 enrichment 循环。
    
    【影响文件】
    - `indexer/product_enrich.py`
    - `indexer/product_enrich_prompts.py`
    - `indexer/document_transformer.py`
    - `indexer/indexer.py`
    - `docs/搜索API对接指南-05-索引接口(Indexer).md`
    -
    `docs/搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md`
    - `taxonomy.md`
    - `tests/test_product_enrich_partial_mode.py`(适配多 profile 测试用例)
    - `tests/test_llm_enrichment_batch_fill.py`
    - `tests/test_process_products_batching.py`
    
    【测试验证】
    - 执行单元测试与集成测试:`pytest
      tests/test_product_enrich_partial_mode.py
    tests/test_llm_enrichment_batch_fill.py
    tests/test_process_products_batching.py
    tests/ci/test_service_api_contracts.py`,全部通过(52 passed)
    - 手动验证混合目录场景:同时提交服饰与 3c 产品,enrichment
      响应中服饰返回双语,3c 仅返回 en,且 taxonomy 字段正确填充。
    - 编译检查:`py_compile` 所有修改模块无语法错误。
    
    【注意事项】
    - 本次重构未改变现有服饰品类的行为,API 向后兼容(未指定 profile
      时仍按服饰处理)。
    - 若后续需为某品类增加双语支持,只需修改注册表中的 `lang` 列表并补充
      prompt 模板,无需改动其他逻辑。
    tangwang
     
  • 2. 删掉自动推断 taxonomy profile的逻辑,build_index_content_fields()
    3. 所有 taxonomy profile 都输出 zh/en”,并把按行业切语言的逻辑去掉
       只接受显式传入的 category_taxonomy_profile
    tangwang
     
  • 问题背景:
    - scripts/
      目录下混有服务启动、数据转换、性能压测、临时脚本及历史备份目录
    - 存在大量中间迭代遗留信息,不利于维护和新人理解
    - 现行服务编排已稳定为 service_ctl up all 的集合:tei / cnclip /
      embedding / embedding-image / translator / reranker / backend /
    indexer / frontend / eval-web,不再保留 reranker-fine 默认位
    
    调整内容:
    1. 根 scripts/ 收敛为运行、运维、环境、数据处理脚本,并新增
       scripts/README.md 说明文档
    2. 性能/压测/调参脚本整体迁至 benchmarks/ 目录,同步更新
       benchmarks/README.md
    3. 人工试跑脚本迁至 tests/manual/ 目录,同步更新 tests/manual/README.md
    4. 删除明确过时内容:
       - scripts/indexer__old_2025_11/
       - scripts/start.sh
       - scripts/install_server_deps.sh
    5. 同步修正以下文档中的路径及过时描述:
       - 根目录 README.md
       - 性能报告相关文档
       - reranker/translation 模块文档
    
    技术细节:
    - 性能测试不放常规 tests/
      的原因:这类脚本依赖真实服务、GPU、模型和环境噪声,不适合作为稳定回归门禁;benchmarks/
    更贴合其定位
    - tests/manual/ 仅存放需要人工启动依赖、手工观察结果的接口试跑脚本
    - 所有迁移后的 Python 脚本已通过 py_compile 语法校验
    - 所有迁移后的 Shell 脚本已通过 bash -n 语法校验
    
    校验结果:
    - py_compile: 通过
    - bash -n: 通过
    tangwang
     
  •   - 数据转换放到 scripts/data_import/README.md
      - 诊断巡检放到 scripts/inspect/README.md
      - 运维辅助放到 scripts/ops/README.md
      - 前端辅助服务放到 scripts/frontend/frontend_server.py
      - 翻译模型下载放到 scripts/translation/download_translation_models.py
      - 临时图片补 embedding 脚本收敛成
        scripts/maintenance/embed_tenant_image_urls.py
      - Redis 监控脚本并入 redis/,现在是 scripts/redis/monitor_eviction.py
    
      同时我把真实调用链都改到了新位置:
    
      - scripts/start_frontend.sh
      - scripts/start_cnclip_service.sh
      - scripts/service_ctl.sh
      - scripts/setup_translator_venv.sh
      - scripts/README.md
    
      文档里涉及这些脚本的路径也同步修了,主要是 docs/QUICKSTART.md 和
    translation/README.md。
    tangwang
     
  • tangwang
     
  • 2. +service_enabled_by_config() {
    reranker|reranker-fine|translator 如果被关闭,则run.sh all 不启动该服务
    tangwang
     
  • tangwang
     
  • tangwang
     
  •  背景与问题
    - 现有粗排/重排依赖 `knn_query` 和 `image_knn_query` 分数,但这两路分数来自 ANN 召回,并非所有进入 rerank_window (160) 的文档都同时命中文本和图片向量召回,导致部分文档得分为 0,影响融合公式的稳定性。
    - 简单扩大 ANN 的 k 无法保证 lexical 召回带来的文档也包含两路向量分;二次查询或拉回向量本地计算均有额外开销且实现复杂。
    
     解决方案
    采用 ES rescore 机制,在第一次搜索的 `window_size` 内对每个文档执行精确的向量 script_score,并将分数以 named query 形式附加到 `matched_queries` 中,供后续 coarse/rerank 优先使用。
    
    **设计决策**:
    - **只补分,不改排序**:rescore 使用 `score_mode: total` 且 `rescore_query_weight: 0.0`,原始 `_score` 保持不变,避免干扰现有排序逻辑,风险最小。
    - **精确分数命名**:`exact_text_knn_query` 和 `exact_image_knn_query`,便于客户端识别和回退。
    - **可配置**:通过 `exact_knn_rescore_enabled` 开关和 `exact_knn_rescore_window` 控制窗口大小,默认 160。
    
     技术实现细节
    
     1. 配置扩展 (`config/config.yaml`, `config/loader.py`)
    ```yaml
    exact_knn_rescore_enabled: true
    exact_knn_rescore_window: 160
    ```
    新增配置项并注入到 `RerankConfig`。
    
     2. Searcher 构建 rescore 查询 (`search/searcher.py`)
    - 在 `_build_es_search_request` 中,当 `enable_rerank=True` 且配置开启时,构造 rescore 对象:
      - `window_size` = `exact_knn_rescore_window`
      - `query` 为一个 `bool` 查询,内嵌两个 `script_score` 子查询,分别计算文本和图片向量的点积相似度:
        ```painless
        // exact_text_knn_query
        (dotProduct(params.query_vector, 'title_embedding') + 1.0) / 2.0
        // exact_image_knn_query
        (dotProduct(params.image_query_vector, 'image_embedding.vector') + 1.0) / 2.0
        ```
      - 每个 `script_score` 都设置 `_name` 为对应的 named query。
    - 注意:当前实现的脚本分数**尚未乘以 knn_text_boost / knn_image_boost**,保持与原始 ANN 分数尺度对齐的后续待办。
    
     3. RerankClient 优先读取 exact 分数 (`search/rerank_client.py`)
    - 在 `_extract_coarse_signals` 中,从文档的 `matched_queries` 里读取 `exact_text_knn_query` 和 `exact_image_knn_query` 分数。
    - 若存在且值有效,则用作 `text_knn_score` / `image_knn_score`,并标记 `text_knn_source='exact_text_knn_query'`。
    - 若不存在,则回退到原有的 `knn_query` / `image_knn_query` (ANN 分数)。
    - 同时保留原始 ANN 分数到 `approx_text_knn_score` / `approx_image_knn_score` 供调试对比。
    
     4. 调试信息增强
    - `debug_info.per_result[*].ranking_funnel.coarse_rank.signals` 中输出 exact 分数、回退分数及来源标记,便于线上观察覆盖率和数值分布。
    
     验证结果
    - 通过单元测试 `tests/test_rerank_client.py` 和 `tests/test_search_rerank_window.py`,验证 exact 优先级、配置解析及 ES 请求体结构。
    - 线上真实查询采样(6 个 query,top160)显示:
      - **exact 覆盖率达到 100%**(文本和图片均有分),解决了原 ANN 部分缺失的问题。
      - 但 exact 分数与原始 ANN 分数存在量级差异(ANN/exact 中位数比值约 4.1 倍),原因是 exact 脚本未乘 boost 因子。
    - 当前排名影响:粗排 top10 重叠度最低降至 1/10,最大排名漂移超过 100。
    
     后续计划
    1. 对齐 exact 分与 ANN 分的尺度:在 script_score 中乘以 `knn_text_boost` / `knn_image_boost`,并对长查询额外乘 1.4。
    2. 重新评估 top10 重叠度和漂移,若收敛则可将 coarse 融合公式整体迁移至 ES rescore 阶段。
    3. 当前版本保持“只补分不改排序”的安全策略,已解决核心的分数缺失问题。
    
     涉及文件
    - `config/config.yaml`
    - `config/loader.py`
    - `search/searcher.py`
    - `search/rerank_client.py`
    - `tests/test_rerank_client.py`
    - `tests/test_search_rerank_window.py`
    tangwang
     
  •  修改内容
    
    1. **新增配置项** (`config/config.yaml`)
       - `exact_knn_rescore_enabled`: 是否开启精确向量重打分,默认 true
       - `exact_knn_rescore_window`: 重打分窗口大小,默认 160(与 rerank_window 解耦,可独立配置)
    
    2. **ES 查询层改造** (`search/searcher.py`)
       - 在第一次 ES 搜索中,根据配置为 window_size 内的文档注入 rescore 阶段
       - rescore_query 中包含两个 named script_score 子句:
         - `exact_text_knn_query`: 对文本向量执行精确点积
         - `exact_image_knn_query`: 对图片向量执行精确点积
       - 当前采用 `score_mode=total` 且 `rescore_query_weight=0.0`,**只补分不改排序**,exact 分仅出现在 `matched_queries` 中
    
    3. **统一向量得分 Boost 逻辑** (`search/es_query_builder.py`)
       - 新增 `_get_knn_plan()` 方法,集中管理文本/图片 KNN 的 boost 计算规则
       - 支持长查询(token 数超过阈值)时文本 boost 额外乘 1.4 倍
       - 精确 rescore 与 ANN 召回**共用同一套 boost 规则**,确保分数量纲一致
       - 原有 ANN 查询构建逻辑同步迁移至该统一入口
    
    4. **融合阶段得分优先级调整** (`search/rerank_client.py`)
       - `_build_hit_signal_bundle()` 中统一处理向量得分读取
       - 优先从 `matched_queries` 读取 `exact_text_knn_query` / `exact_image_knn_query`
       - 若不存在则回退到原 `knn_query` / `image_knn_query`(ANN 得分)
       - 覆盖 coarse_rank、fine_rank、rerank 三个阶段,避免重复补丁
    
    5. **测试覆盖**
       - `tests/test_es_query_builder.py`: 验证 ANN 与 exact 共用 boost 规则
       - `tests/test_search_rerank_window.py`: 验证 rescore 窗口及 named query 正确注入
       - `tests/test_rerank_client.py`: 验证 exact 优先、回退 ANN 的逻辑
    
     技术细节
    
    - **精确向量计算脚本** (Painless)
      ```painless
      // 文本 (dotProduct + 1.0) / 2.0
      (dotProduct(params.query_vector, 'title_embedding') + 1.0) / 2.0
      // 图片同理,字段为 'image_embedding.vector'
      ```
      乘以统一的 boost(来自配置 `knn_text_boost` / `knn_image_boost` 及长查询放大因子)。
    
    - **named query 保留机制**
      - 主查询中已开启 `include_named_queries_score: true`
      - rescore 阶段命名的脚本得分会合并到每个 hit 的 `matched_queries` 中
      - 通过 `_extract_named_score()` 按名称提取,与原始 ANN 得分访问方式完全一致
    
    - **性能影响** (基于 top160、6 条真实查询、warm-up 后 3 轮平均)
      - `elasticsearch_search_primary` 耗时: 124.71ms → 136.60ms (+11.89ms, +9.53%)
      - `total_search` 受其他组件抖动影响较大,不作为主要参考
      - 该开销在可接受范围内,未出现超时或资源瓶颈
    
     配置示例
    
    ```yaml
    search:
      exact_knn_rescore_enabled: true
      exact_knn_rescore_window: 160
      knn_text_boost: 4.0
      knn_image_boost: 4.0
      long_query_token_threshold: 8
      long_query_text_boost_factor: 1.4
    ```
    
     已知问题与后续计划
    
    - 当前版本经过调参实验发现,开启 exact rescore 后部分 query(强类型约束 + 多风格/颜色相似)的主指标相比 baseline(exact=false)下降约 0.031(0.6009 → 0.5697)
    - 根因:exact 将 KNN 从稀疏辅助信号变为 dense 排序因子,coarse 阶段排序语义变化,单纯调整现有 `knn_bias/exponent` 无法完全恢复
    - 后续迭代方向:**coarse 阶段暂不强制使用 exact**,仅 fine/rerank 优先 exact;或 coarse 采用“ANN 优先,exact 只补缺失”策略,再重新评估
    
     相关文件
    
    - `config/config.yaml`
    - `search/searcher.py`
    - `search/es_query_builder.py`
    - `search/rerank_client.py`
    - `tests/test_es_query_builder.py`
    - `tests/test_search_rerank_window.py`
    - `tests/test_rerank_client.py`
    - `scripts/evaluation/exact_rescore_coarse_tuning_round2.json` (调参实验记录)
    tangwang
     
Showing 137 changed files   Show diff stats
... ... @@ -4,6 +4,7 @@
4 4 ES_HOST=http://localhost:9200
5 5 ES_USERNAME=saas
6 6 ES_PASSWORD=4hOaLaf41y2VuI8y
  7 +ES_AUTH="${ES_USERNAME}:${ES_PASSWORD}"
7 8  
8 9 # Redis Configuration (Optional) - AI 生产 10.200.16.14:6479
9 10 REDIS_HOST=10.200.16.14
... ...
.env.example
... ... @@ -8,6 +8,7 @@
8 8 ES_HOST=http://localhost:9200
9 9 ES_USERNAME=saas
10 10 ES_PASSWORD=
  11 +ES_AUTH="${ES_USERNAME}:${ES_PASSWORD}"
11 12  
12 13 # Redis (生产默认 10.200.16.14:6479,密码见 docs/QUICKSTART.md §1.6)
13 14 REDIS_HOST=10.200.16.14
... ...
CLAUDE.md
... ... @@ -77,9 +77,11 @@ source activate.sh
77 77 # Generate test data (Tenant1 Mock + Tenant2 CSV)
78 78 ./scripts/mock_data.sh
79 79  
80   -# Ingest data to Elasticsearch
81   -./scripts/ingest.sh <tenant_id> [recreate] # e.g., ./scripts/ingest.sh 1 true
82   -python main.py ingest data.csv --limit 1000 --batch-size 50
  80 +# Create tenant index structure
  81 +./scripts/create_tenant_index.sh <tenant_id>
  82 +
  83 +# Build / refresh suggestion index
  84 +./scripts/build_suggestions.sh <tenant_id> --mode incremental
83 85 ```
84 86  
85 87 ### Running Services
... ... @@ -100,10 +102,10 @@ python main.py serve --host 0.0.0.0 --port 6002 --reload
100 102 # Run all tests
101 103 pytest tests/
102 104  
103   -# Run specific test types
104   -pytest tests/unit/ # Unit tests
105   -pytest tests/integration/ # Integration tests
106   -pytest -m "api" # API tests only
  105 +# Run focused regression sets
  106 +python -m pytest tests/ci -q
  107 +pytest tests/test_rerank_client.py
  108 +pytest tests/test_query_parser_mixed_language.py
107 109  
108 110 # Test search from command line
109 111 python main.py search "query" --tenant-id 1 --size 10
... ... @@ -114,12 +116,8 @@ python main.py search &quot;query&quot; --tenant-id 1 --size 10
114 116 # Stop all services
115 117 ./scripts/stop.sh
116 118  
117   -# Test environment (for CI/development)
118   -./scripts/start_test_environment.sh
119   -./scripts/stop_test_environment.sh
120   -
121   -# Install server dependencies
122   -./scripts/install_server_deps.sh
  119 +# Run CI contract tests
  120 +./scripts/run_ci_tests.sh
123 121 ```
124 122  
125 123 ## Architecture Overview
... ... @@ -585,7 +583,7 @@ GET /admin/stats # Index statistics
585 583 ./scripts/start_frontend.sh # Frontend UI (port 6003)
586 584  
587 585 # Data Operations
588   -./scripts/ingest.sh <tenant_id> [recreate] # Index data
  586 +./scripts/create_tenant_index.sh <tenant_id> # Create tenant index
589 587 ./scripts/mock_data.sh # Generate test data
590 588  
591 589 # Testing
... ...
api/models.py
... ... @@ -154,7 +154,8 @@ class SearchRequest(BaseModel):
154 154 enable_rerank: Optional[bool] = Field(
155 155 None,
156 156 description=(
157   - "是否开启重排(调用外部重排服务对 ES 结果进行二次排序)。"
  157 + "是否开启最终重排(调用外部 rerank 服务改写上一阶段顺序)。"
  158 + "关闭时仍保留 coarse/fine 流程,仅在 rerank 阶段保序透传。"
158 159 "不传则使用服务端配置 rerank.enabled(默认开启)。"
159 160 )
160 161 )
... ...
api/routes/indexer.py
... ... @@ -7,7 +7,7 @@
7 7 import asyncio
8 8 import re
9 9 from fastapi import APIRouter, HTTPException
10   -from typing import Any, Dict, List, Optional
  10 +from typing import Any, Dict, List, Literal, Optional
11 11 from pydantic import BaseModel, Field
12 12 import logging
13 13 from sqlalchemy import text
... ... @@ -19,6 +19,11 @@ logger = logging.getLogger(__name__)
19 19  
20 20 router = APIRouter(prefix="/indexer", tags=["indexer"])
21 21  
  22 +SUPPORTED_CATEGORY_TAXONOMY_PROFILES = (
  23 + "apparel, 3c, bags, pet_supplies, electronics, outdoor, "
  24 + "home_appliances, home_living, wigs, beauty, accessories, toys, shoes, sports, others"
  25 +)
  26 +
22 27  
23 28 class ReindexRequest(BaseModel):
24 29 """全量重建索引请求"""
... ... @@ -88,11 +93,42 @@ class EnrichContentItem(BaseModel):
88 93  
89 94 class EnrichContentRequest(BaseModel):
90 95 """
91   - 内容理解字段生成请求:根据商品标题批量生成 qanchors、enriched_attributes、tags
  96 + 内容理解字段生成请求:根据商品标题批量生成通用增强字段与品类 taxonomy 字段
92 97 供外部 indexer 在自行组织 doc 时调用,与翻译、向量化等微服务并列。
93 98 """
94 99 tenant_id: str = Field(..., description="租户 ID,用于请求路由与结果归属,不参与缓存键")
95 100 items: List[EnrichContentItem] = Field(..., description="待分析的 SPU 列表(spu_id + title,可附带 brief/description/image_url)")
  101 + enrichment_scopes: Optional[List[Literal["generic", "category_taxonomy"]]] = Field(
  102 + default=None,
  103 + description=(
  104 + "要执行的增强范围。"
  105 + "`generic` 返回 qanchors/enriched_tags/enriched_attributes;"
  106 + "`category_taxonomy` 返回 enriched_taxonomy_attributes。"
  107 + "默认两者都执行。"
  108 + ),
  109 + )
  110 + category_taxonomy_profile: str = Field(
  111 + "apparel",
  112 + description=(
  113 + "品类 taxonomy profile。默认 `apparel`。"
  114 + f"当前支持:{SUPPORTED_CATEGORY_TAXONOMY_PROFILES}。"
  115 + "其中除 `apparel` 外,其余 profile 的 taxonomy 输出仅返回 `en`。"
  116 + ),
  117 + )
  118 + analysis_kinds: Optional[List[Literal["content", "taxonomy"]]] = Field(
  119 + default=None,
  120 + description="Deprecated alias of enrichment_scopes. `content` -> `generic`, `taxonomy` -> `category_taxonomy`.",
  121 + )
  122 +
  123 + def resolved_enrichment_scopes(self) -> List[str]:
  124 + if self.enrichment_scopes:
  125 + return list(self.enrichment_scopes)
  126 + if self.analysis_kinds:
  127 + mapped = []
  128 + for item in self.analysis_kinds:
  129 + mapped.append("generic" if item == "content" else "category_taxonomy")
  130 + return mapped
  131 + return ["generic", "category_taxonomy"]
96 132  
97 133  
98 134 @router.post("/reindex")
... ... @@ -440,20 +476,31 @@ async def build_docs_from_db(request: BuildDocsFromDbRequest):
440 476 raise HTTPException(status_code=500, detail=f"Internal server error: {str(e)}")
441 477  
442 478  
443   -def _run_enrich_content(tenant_id: str, items: List[Dict[str, str]]) -> List[Dict[str, Any]]:
  479 +def _run_enrich_content(
  480 + tenant_id: str,
  481 + items: List[Dict[str, str]],
  482 + enrichment_scopes: Optional[List[str]] = None,
  483 + category_taxonomy_profile: str = "apparel",
  484 +) -> List[Dict[str, Any]]:
444 485 """
445 486 同步执行内容理解,返回与 ES mapping 对齐的字段结构。
446 487 语言策略由 product_enrich 内部统一决定,路由层不参与。
447 488 """
448 489 from indexer.product_enrich import build_index_content_fields
449 490  
450   - results = build_index_content_fields(items=items, tenant_id=tenant_id)
  491 + results = build_index_content_fields(
  492 + items=items,
  493 + tenant_id=tenant_id,
  494 + enrichment_scopes=enrichment_scopes,
  495 + category_taxonomy_profile=category_taxonomy_profile,
  496 + )
451 497 return [
452 498 {
453 499 "spu_id": item["id"],
454 500 "qanchors": item["qanchors"],
455 501 "enriched_attributes": item["enriched_attributes"],
456 502 "enriched_tags": item["enriched_tags"],
  503 + "enriched_taxonomy_attributes": item["enriched_taxonomy_attributes"],
457 504 **({"error": item["error"]} if item.get("error") else {}),
458 505 }
459 506 for item in results
... ... @@ -463,15 +510,15 @@ def _run_enrich_content(tenant_id: str, items: List[Dict[str, str]]) -&gt; List[Dic
463 510 @router.post("/enrich-content")
464 511 async def enrich_content(request: EnrichContentRequest):
465 512 """
466   - 内容理解字段生成接口:根据商品标题批量生成 qanchors、enriched_attributes、tags
  513 + 内容理解字段生成接口:根据商品标题批量生成通用增强字段与品类 taxonomy 字段
467 514  
468 515 使用场景:
469 516 - 外部 indexer 采用「微服务组合」方式自己组织 doc 时,可调用本接口获取 LLM 生成的
470 517 锚文本与语义属性,再与翻译、向量化结果合并写入 ES。
471 518 - 与 /indexer/build-docs 解耦,避免 build-docs 因 LLM 耗时过长而阻塞;调用方可
472   - 先拿不含 qanchors/enriched_tags 的 doc,再异步或离线补齐本接口结果后更新 ES。
  519 + 先拿不含 qanchors/enriched_tags/taxonomy attributes 的 doc,再异步或离线补齐本接口结果后更新 ES。
473 520  
474   - 实现逻辑与 indexer.product_enrich.analyze_products 一致,支持多语言与 Redis 缓存。
  521 + 实现逻辑与 indexer.product_enrich.build_index_content_fields 一致,支持多语言与 Redis 缓存。
475 522 """
476 523 try:
477 524 if not request.items:
... ... @@ -493,15 +540,20 @@ async def enrich_content(request: EnrichContentRequest):
493 540 for it in request.items
494 541 ]
495 542 loop = asyncio.get_event_loop()
  543 + enrichment_scopes = request.resolved_enrichment_scopes()
496 544 result = await loop.run_in_executor(
497 545 None,
498 546 lambda: _run_enrich_content(
499 547 tenant_id=request.tenant_id,
500   - items=items_payload
  548 + items=items_payload,
  549 + enrichment_scopes=enrichment_scopes,
  550 + category_taxonomy_profile=request.category_taxonomy_profile,
501 551 ),
502 552 )
503 553 return {
504 554 "tenant_id": request.tenant_id,
  555 + "enrichment_scopes": enrichment_scopes,
  556 + "category_taxonomy_profile": request.category_taxonomy_profile,
505 557 "results": result,
506 558 "total": len(result),
507 559 }
... ...
api/translator_app.py
... ... @@ -271,16 +271,20 @@ async def lifespan(_: FastAPI):
271 271 """Initialize all enabled translation backends on process startup."""
272 272 logger.info("Starting Translation Service API")
273 273 service = get_translation_service()
  274 + failed_models = list(getattr(service, "failed_models", []))
  275 + backend_errors = dict(getattr(service, "backend_errors", {}))
274 276 logger.info(
275   - "Translation service ready | default_model=%s default_scene=%s available_models=%s loaded_models=%s",
  277 + "Translation service ready | default_model=%s default_scene=%s available_models=%s loaded_models=%s failed_models=%s",
276 278 service.config["default_model"],
277 279 service.config["default_scene"],
278 280 service.available_models,
279 281 service.loaded_models,
  282 + failed_models,
280 283 )
281 284 logger.info(
282   - "Translation backends initialized on startup | models=%s",
  285 + "Translation backends initialized on startup | loaded=%s failed=%s",
283 286 service.loaded_models,
  287 + backend_errors,
284 288 )
285 289 verbose_logger.info(
286 290 "Translation startup detail | capabilities=%s cache_ttl_seconds=%s cache_sliding_expiration=%s",
... ... @@ -316,11 +320,14 @@ async def health_check():
316 320 """Health check endpoint."""
317 321 try:
318 322 service = get_translation_service()
  323 + failed_models = list(getattr(service, "failed_models", []))
  324 + backend_errors = dict(getattr(service, "backend_errors", {}))
319 325 logger.info(
320   - "Health check | default_model=%s default_scene=%s loaded_models=%s",
  326 + "Health check | default_model=%s default_scene=%s loaded_models=%s failed_models=%s",
321 327 service.config["default_model"],
322 328 service.config["default_scene"],
323 329 service.loaded_models,
  330 + failed_models,
324 331 )
325 332 return {
326 333 "status": "healthy",
... ... @@ -330,6 +337,8 @@ async def health_check():
330 337 "available_models": service.available_models,
331 338 "enabled_capabilities": get_enabled_translation_models(service.config),
332 339 "loaded_models": service.loaded_models,
  340 + "failed_models": failed_models,
  341 + "backend_errors": backend_errors,
333 342 }
334 343 except Exception as e:
335 344 logger.error(f"Health check failed: {e}")
... ... @@ -463,6 +472,10 @@ async def translate(request: TranslationRequest, http_request: Request):
463 472 latency_ms = (time.perf_counter() - request_started) * 1000
464 473 logger.warning("Translation validation error | error=%s latency_ms=%.2f", e, latency_ms)
465 474 raise HTTPException(status_code=400, detail=str(e)) from e
  475 + except RuntimeError as e:
  476 + latency_ms = (time.perf_counter() - request_started) * 1000
  477 + logger.warning("Translation backend unavailable | error=%s latency_ms=%.2f", e, latency_ms)
  478 + raise HTTPException(status_code=503, detail=str(e)) from e
466 479 except Exception as e:
467 480 latency_ms = (time.perf_counter() - request_started) * 1000
468 481 logger.error("Translation error | error=%s latency_ms=%.2f", e, latency_ms, exc_info=True)
... ...
benchmarks/README.md 0 → 100644
... ... @@ -0,0 +1,17 @@
  1 +# Benchmarks
  2 +
  3 +基准压测脚本统一放在 `benchmarks/`,不再和 `scripts/` 里的服务启动/运维脚本混放。
  4 +
  5 +目录约定:
  6 +
  7 +- `benchmarks/perf_api_benchmark.py`:通用 HTTP 接口压测入口
  8 +- `benchmarks/reranker/`:reranker 定向 benchmark、smoke、手工对比脚本
  9 +- `benchmarks/translation/`:translation 本地模型 benchmark
  10 +
  11 +这些脚本默认不是 CI 测试的一部分,因为它们通常具备以下特征:
  12 +
  13 +- 依赖真实服务、GPU、模型或特定数据集
  14 +- 结果受机器配置和运行时负载影响,不适合作为稳定回归门禁
  15 +- 更多用于容量评估、调参和问题复现,而不是功能正确性判定
  16 +
  17 +如果某个性能场景需要进入自动化回归,应新增到 `tests/` 下并明确收敛输入、环境和判定阈值,而不是直接复用这里的基准脚本。
... ...
scripts/perf_api_benchmark.py renamed to benchmarks/perf_api_benchmark.py
... ... @@ -11,13 +11,13 @@ Default scenarios (aligned with docs/搜索API对接指南 分册,如 -01 / -0
11 11 - rerank POST /rerank
12 12  
13 13 Examples:
14   - python scripts/perf_api_benchmark.py --scenario backend_search --duration 30 --concurrency 20 --tenant-id 162
15   - python scripts/perf_api_benchmark.py --scenario backend_suggest --duration 30 --concurrency 50 --tenant-id 162
16   - python scripts/perf_api_benchmark.py --scenario all --duration 60 --concurrency 80 --tenant-id 162
17   - python scripts/perf_api_benchmark.py --scenario all --cases-file scripts/perf_cases.json.example --output perf_result.json
  14 + python benchmarks/perf_api_benchmark.py --scenario backend_search --duration 30 --concurrency 20 --tenant-id 162
  15 + python benchmarks/perf_api_benchmark.py --scenario backend_suggest --duration 30 --concurrency 50 --tenant-id 162
  16 + python benchmarks/perf_api_benchmark.py --scenario all --duration 60 --concurrency 80 --tenant-id 162
  17 + python benchmarks/perf_api_benchmark.py --scenario all --cases-file benchmarks/perf_cases.json.example --output perf_result.json
18 18 # Embedding admission / priority (query param `priority`; same semantics as embedding service):
19   - python scripts/perf_api_benchmark.py --scenario embed_text --embed-text-priority 1 --duration 30 --concurrency 20
20   - python scripts/perf_api_benchmark.py --scenario embed_image --embed-image-priority 1 --duration 30 --concurrency 10
  19 + python benchmarks/perf_api_benchmark.py --scenario embed_text --embed-text-priority 1 --duration 30 --concurrency 20
  20 + python benchmarks/perf_api_benchmark.py --scenario embed_image --embed-image-priority 1 --duration 30 --concurrency 10
21 21 """
22 22  
23 23 from __future__ import annotations
... ... @@ -229,7 +229,7 @@ def apply_embed_priority_params(
229 229 ) -> None:
230 230 """
231 231 Merge default `priority` query param into embed templates when absent.
232   - `scripts/perf_cases.json` may set per-request `params.priority` to override.
  232 + `benchmarks/perf_cases.json` may set per-request `params.priority` to override.
233 233 """
234 234 mapping = {
235 235 "embed_text": max(0, int(embed_text_priority)),
... ...
scripts/perf_cases.json.example renamed to benchmarks/perf_cases.json.example
scripts/benchmark_reranker_1000docs.sh renamed to benchmarks/reranker/benchmark_reranker_1000docs.sh
... ... @@ -8,7 +8,7 @@
8 8 # Outputs JSON reports under perf_reports/<date>/reranker_1000docs/
9 9 #
10 10 # Usage:
11   -# ./scripts/benchmark_reranker_1000docs.sh
  11 +# ./benchmarks/reranker/benchmark_reranker_1000docs.sh
12 12 # Optional env:
13 13 # BATCH_SIZES="24 32 48 64"
14 14 # C1_REQUESTS=4
... ... @@ -85,7 +85,7 @@ run_bench() {
85 85 local c="$2"
86 86 local req="$3"
87 87 local out="${OUT_DIR}/rerank_bs${bs}_c${c}_r${req}.json"
88   - .venv/bin/python scripts/perf_api_benchmark.py \
  88 + .venv/bin/python benchmarks/perf_api_benchmark.py \
89 89 --scenario rerank \
90 90 --tenant-id "${TENANT_ID}" \
91 91 --reranker-base "${RERANK_BASE}" \
... ...
scripts/benchmark_reranker_gguf_local.py renamed to benchmarks/reranker/benchmark_reranker_gguf_local.py
... ... @@ -8,8 +8,8 @@ Runs the backend directly in a fresh process per config to measure:
8 8 - single-request rerank latency
9 9  
10 10 Example:
11   - ./.venv-reranker-gguf/bin/python scripts/benchmark_reranker_gguf_local.py
12   - ./.venv-reranker-gguf-06b/bin/python scripts/benchmark_reranker_gguf_local.py --backend-name qwen3_gguf_06b --docs 400
  11 + ./.venv-reranker-gguf/bin/python benchmarks/reranker/benchmark_reranker_gguf_local.py
  12 + ./.venv-reranker-gguf-06b/bin/python benchmarks/reranker/benchmark_reranker_gguf_local.py --backend-name qwen3_gguf_06b --docs 400
13 13 """
14 14  
15 15 from __future__ import annotations
... ...
scripts/benchmark_reranker_random_titles.py renamed to benchmarks/reranker/benchmark_reranker_random_titles.py
... ... @@ -10,10 +10,10 @@ Each invocation runs 3 warmup requests with n=400 first; those are not timed for
10 10  
11 11 Example:
12 12 source activate.sh
13   - python scripts/benchmark_reranker_random_titles.py 386
14   - python scripts/benchmark_reranker_random_titles.py 40,80,100
15   - python scripts/benchmark_reranker_random_titles.py 40,80,100 --repeat 3 --seed 42
16   - RERANK_BASE=http://127.0.0.1:6007 python scripts/benchmark_reranker_random_titles.py 200
  13 + python benchmarks/reranker/benchmark_reranker_random_titles.py 386
  14 + python benchmarks/reranker/benchmark_reranker_random_titles.py 40,80,100
  15 + python benchmarks/reranker/benchmark_reranker_random_titles.py 40,80,100 --repeat 3 --seed 42
  16 + RERANK_BASE=http://127.0.0.1:6007 python benchmarks/reranker/benchmark_reranker_random_titles.py 200
17 17 """
18 18  
19 19 from __future__ import annotations
... ...
tests/reranker_performance/curl1.sh renamed to benchmarks/reranker/manual/curl1.sh
tests/reranker_performance/curl1_simple.sh renamed to benchmarks/reranker/manual/curl1_simple.sh
tests/reranker_performance/curl2.sh renamed to benchmarks/reranker/manual/curl2.sh
tests/reranker_performance/rerank_performance_compare.sh renamed to benchmarks/reranker/manual/rerank_performance_compare.sh
scripts/patch_rerank_vllm_benchmark_config.py renamed to benchmarks/reranker/patch_rerank_vllm_benchmark_config.py
... ... @@ -73,7 +73,7 @@ def main() -&gt; int:
73 73 p.add_argument(
74 74 "--config",
75 75 type=Path,
76   - default=Path(__file__).resolve().parent.parent / "config" / "config.yaml",
  76 + default=Path(__file__).resolve().parents[2] / "config" / "config.yaml",
77 77 )
78 78 p.add_argument("--backend", choices=("qwen3_vllm", "qwen3_vllm_score"), required=True)
79 79 p.add_argument(
... ...
scripts/run_reranker_vllm_instruction_benchmark.sh renamed to benchmarks/reranker/run_reranker_vllm_instruction_benchmark.sh
... ... @@ -55,13 +55,13 @@ run_one() {
55 55 local jf="${OUT_DIR}/${backend}_${fmt}.json"
56 56  
57 57 echo "========== ${tag} =========="
58   - "$PYTHON" "${ROOT}/scripts/patch_rerank_vllm_benchmark_config.py" \
  58 + "$PYTHON" "${ROOT}/benchmarks/reranker/patch_rerank_vllm_benchmark_config.py" \
59 59 --backend "$backend" --instruction-format "$fmt"
60 60  
61 61 "${ROOT}/restart.sh" reranker
62 62 wait_health "$backend" "$fmt"
63 63  
64   - if ! "$PYTHON" "${ROOT}/scripts/benchmark_reranker_random_titles.py" \
  64 + if ! "$PYTHON" "${ROOT}/benchmarks/reranker/benchmark_reranker_random_titles.py" \
65 65 100,200,400,600,800,1000 \
66 66 --repeat 5 \
67 67 --seed 42 \
... ... @@ -82,7 +82,7 @@ run_one qwen3_vllm_score compact
82 82 run_one qwen3_vllm_score standard
83 83  
84 84 # Restore repo-default-style rerank settings (score + compact).
85   -"$PYTHON" "${ROOT}/scripts/patch_rerank_vllm_benchmark_config.py" \
  85 +"$PYTHON" "${ROOT}/benchmarks/reranker/patch_rerank_vllm_benchmark_config.py" \
86 86 --backend qwen3_vllm_score --instruction-format compact
87 87 "${ROOT}/restart.sh" reranker
88 88 wait_health qwen3_vllm_score compact
... ...
scripts/smoke_qwen3_vllm_score_backend.py renamed to benchmarks/reranker/smoke_qwen3_vllm_score_backend.py
... ... @@ -3,7 +3,7 @@
3 3 Smoke test: load Qwen3VLLMScoreRerankerBackend (must run as a file, not stdin — vLLM spawn).
4 4  
5 5 Usage (from repo root, score venv):
6   - PYTHONPATH=. ./.venv-reranker-score/bin/python scripts/smoke_qwen3_vllm_score_backend.py
  6 + PYTHONPATH=. ./.venv-reranker-score/bin/python benchmarks/reranker/smoke_qwen3_vllm_score_backend.py
7 7  
8 8 Same as production: vLLM child processes need the venv's ``bin`` on PATH (for pip's ``ninja`` when
9 9 vLLM auto-selects FLASHINFER on T4/Turing). ``start_reranker.sh`` exports that; this script prepends
... ... @@ -20,8 +20,8 @@ import sys
20 20 import sysconfig
21 21 from pathlib import Path
22 22  
23   -# Repo root on sys.path when run as scripts/smoke_*.py
24   -_ROOT = Path(__file__).resolve().parents[1]
  23 +# Repo root on sys.path when run from benchmarks/reranker/.
  24 +_ROOT = Path(__file__).resolve().parents[2]
25 25 if str(_ROOT) not in sys.path:
26 26 sys.path.insert(0, str(_ROOT))
27 27  
... ...
scripts/benchmark_nllb_t4_tuning.py renamed to benchmarks/translation/benchmark_nllb_t4_tuning.py
... ... @@ -11,12 +11,12 @@ from datetime import datetime
11 11 from pathlib import Path
12 12 from typing import Any, Dict, List, Tuple
13 13  
14   -PROJECT_ROOT = Path(__file__).resolve().parent.parent
  14 +PROJECT_ROOT = Path(__file__).resolve().parents[2]
15 15 if str(PROJECT_ROOT) not in sys.path:
16 16 sys.path.insert(0, str(PROJECT_ROOT))
17 17  
18 18 from config.services_config import get_translation_config
19   -from scripts.benchmark_translation_local_models import (
  19 +from benchmarks.translation.benchmark_translation_local_models import (
20 20 benchmark_concurrency_case,
21 21 benchmark_serial_case,
22 22 build_environment_info,
... ...
scripts/benchmark_translation_local_models.py renamed to benchmarks/translation/benchmark_translation_local_models.py
... ... @@ -22,7 +22,7 @@ from typing import Any, Dict, Iterable, List, Sequence
22 22 import torch
23 23 import transformers
24 24  
25   -PROJECT_ROOT = Path(__file__).resolve().parent.parent
  25 +PROJECT_ROOT = Path(__file__).resolve().parents[2]
26 26 if str(PROJECT_ROOT) not in sys.path:
27 27 sys.path.insert(0, str(PROJECT_ROOT))
28 28  
... ...
scripts/benchmark_translation_local_models_focus.py renamed to benchmarks/translation/benchmark_translation_local_models_focus.py
... ... @@ -11,12 +11,12 @@ from datetime import datetime
11 11 from pathlib import Path
12 12 from typing import Any, Dict, List
13 13  
14   -PROJECT_ROOT = Path(__file__).resolve().parent.parent
  14 +PROJECT_ROOT = Path(__file__).resolve().parents[2]
15 15 if str(PROJECT_ROOT) not in sys.path:
16 16 sys.path.insert(0, str(PROJECT_ROOT))
17 17  
18 18 from config.services_config import get_translation_config
19   -from scripts.benchmark_translation_local_models import (
  19 +from benchmarks.translation.benchmark_translation_local_models import (
20 20 SCENARIOS,
21 21 benchmark_concurrency_case,
22 22 benchmark_serial_case,
... ...
scripts/benchmark_translation_longtext_single.py renamed to benchmarks/translation/benchmark_translation_longtext_single.py
... ... @@ -13,7 +13,7 @@ from pathlib import Path
13 13  
14 14 import torch
15 15  
16   -PROJECT_ROOT = Path(__file__).resolve().parent.parent
  16 +PROJECT_ROOT = Path(__file__).resolve().parents[2]
17 17  
18 18 import sys
19 19  
... ...
config/config.yaml
1   -# Unified Configuration for Multi-Tenant Search Engine
2   -# 统一配置文件,所有租户共用一套配置
3   -# 注意:索引结构由 mappings/search_products.json 定义,此文件只配置搜索行为
4   -#
5   -# 约定:下列键为必填;进程环境变量可覆盖 infrastructure / runtime 中同名语义项
6   -#(如 ES_HOST、API_PORT 等),未设置环境变量时使用本文件中的值。
7   -
8   -# Process / bind addresses (环境变量 APP_ENV、RUNTIME_ENV、ES_INDEX_NAMESPACE 可覆盖前两者的语义)
9 1 runtime:
10 2 environment: prod
11 3 index_namespace: ''
... ... @@ -21,8 +13,6 @@ runtime:
21 13 translator_port: 6006
22 14 reranker_host: 0.0.0.0
23 15 reranker_port: 6007
24   -
25   -# 基础设施连接(敏感项优先读环境变量:ES_*、REDIS_*、DB_*、DASHSCOPE_API_KEY、DEEPL_AUTH_KEY)
26 16 infrastructure:
27 17 elasticsearch:
28 18 host: http://localhost:9200
... ... @@ -49,23 +39,12 @@ infrastructure:
49 39 secrets:
50 40 dashscope_api_key: null
51 41 deepl_auth_key: null
52   -
53   -# Elasticsearch Index
54 42 es_index_name: search_products
55   -
56   -# 检索域 / 索引列表(可为空列表;每项字段均需显式给出)
57 43 indexes: []
58   -
59   -# Config assets
60 44 assets:
61 45 query_rewrite_dictionary_path: config/dictionaries/query_rewrite.dict
62   -
63   -# Product content understanding (LLM enrich-content) configuration
64 46 product_enrich:
65 47 max_workers: 40
66   -
67   -# 离线 / Web 相关性评估(scripts/evaluation、eval-web)
68   -# CLI 未显式传参时使用此处默认值;search_base_url 未配置时自动为 http://127.0.0.1:{runtime.api_port}
69 48 search_evaluation:
70 49 artifact_root: artifacts/search_evaluation
71 50 queries_file: scripts/evaluation/queries/queries.txt
... ... @@ -74,10 +53,10 @@ search_evaluation:
74 53 search_base_url: ''
75 54 web_host: 0.0.0.0
76 55 web_port: 6010
77   - judge_model: qwen3.5-plus
  56 + judge_model: qwen3.6-plus
78 57 judge_enable_thinking: false
79 58 judge_dashscope_batch: false
80   - intent_model: qwen3-max
  59 + intent_model: qwen3.6-plus
81 60 intent_enable_thinking: true
82 61 judge_batch_completion_window: 24h
83 62 judge_batch_poll_interval_sec: 10.0
... ... @@ -98,20 +77,17 @@ search_evaluation:
98 77 rebuild_irrelevant_stop_ratio: 0.799
99 78 rebuild_irrel_low_combined_stop_ratio: 0.959
100 79 rebuild_irrelevant_stop_streak: 3
101   -
102   -# ES Index Settings (基础设置)
103 80 es_settings:
104 81 number_of_shards: 1
105 82 number_of_replicas: 0
106 83 refresh_interval: 30s
107 84  
108   -# 字段权重配置(用于搜索时的字段boost)
109   -# 统一按“字段基名”配置;查询时按实际检索语言动态拼接 .{lang}。
110   -# 若需要按某个语言单独调权,也可以加显式 key(例如 title.de: 3.2)。
  85 +# 统一按“字段基名”配置;查询时按实际检索语言动态拼接 .{lang}
111 86 field_boosts:
112 87 title: 3.0
113   - qanchors: 1.8
114   - enriched_tags: 1.8
  88 + # qanchors enriched_tags 在 enriched_attributes.value中也存在,所以其实他的权重为自身权重+enriched_attributes.value的权重
  89 + qanchors: 1.0
  90 + enriched_tags: 1.0
115 91 enriched_attributes.value: 1.5
116 92 category_name_text: 2.0
117 93 category_path: 2.0
... ... @@ -124,38 +100,25 @@ field_boosts:
124 100 description: 1.0
125 101 vendor: 1.0
126 102  
127   -# Query Configuration(查询配置)
128 103 query_config:
129   - # 支持的语言
130 104 supported_languages:
131 105 - zh
132 106 - en
133 107 default_language: en
134   -
135   - # 功能开关(翻译开关由tenant_config控制)
136 108 enable_text_embedding: true
137 109 enable_query_rewrite: true
138 110  
139   - # 查询翻译模型(须与 services.translation.capabilities 中某项一致)
140   - # 源语种在租户 index_languages 内:主召回可打在源语种字段,用下面三项。
141   - zh_to_en_model: nllb-200-distilled-600m # "opus-mt-zh-en"
142   - en_to_zh_model: nllb-200-distilled-600m # "opus-mt-en-zh"
143   - default_translation_model: nllb-200-distilled-600m
144   - # zh_to_en_model: deepl
145   - # en_to_zh_model: deepl
146   - # default_translation_model: deepl
147   - # 源语种不在 index_languages:翻译对可检索文本更关键,可单独指定(缺省则与上一组相同)
148   - zh_to_en_model__source_not_in_index: nllb-200-distilled-600m
149   - en_to_zh_model__source_not_in_index: nllb-200-distilled-600m
150   - default_translation_model__source_not_in_index: nllb-200-distilled-600m
151   - # zh_to_en_model__source_not_in_index: deepl
152   - # en_to_zh_model__source_not_in_index: deepl
153   - # default_translation_model__source_not_in_index: deepl
  111 + zh_to_en_model: deepl # nllb-200-distilled-600m
  112 + en_to_zh_model: deepl
  113 + default_translation_model: deepl
  114 + # 源语种不在 index_languages时翻译质量比较重要,因此单独配置
  115 + zh_to_en_model__source_not_in_index: deepl
  116 + en_to_zh_model__source_not_in_index: deepl
  117 + default_translation_model__source_not_in_index: deepl
154 118  
155   - # 查询解析阶段:翻译与 query 向量并发执行,共用同一等待预算(毫秒)。
156   - # 检测语言已在租户 index_languages 内:较短;不在索引语言内:较长(翻译对召回更关键)。
157   - translation_embedding_wait_budget_ms_source_in_index: 300 # 80
158   - translation_embedding_wait_budget_ms_source_not_in_index: 400 # 200
  119 + # 查询解析阶段:翻译与 query 向量并发执行,共用同一等待预算(毫秒)
  120 + translation_embedding_wait_budget_ms_source_in_index: 300
  121 + translation_embedding_wait_budget_ms_source_not_in_index: 400
159 122 style_intent:
160 123 enabled: true
161 124 selected_sku_boost: 1.2
... ... @@ -182,17 +145,15 @@ query_config:
182 145 product_title_exclusion:
183 146 enabled: true
184 147 dictionary_path: config/dictionaries/product_title_exclusion.tsv
185   -
186   - # 动态多语言检索字段配置
187   - # multilingual_fields 会被拼成 title.{lang}/brief.{lang}/... 形式;
188   - # shared_fields 为无语言后缀字段。
189 148 search_fields:
  149 + # 统一按“字段基名”配置;查询时按实际检索语言动态拼接 .{lang}
190 150 multilingual_fields:
191 151 - title
192 152 - keywords
193 153 - qanchors
194 154 - enriched_tags
195 155 - enriched_attributes.value
  156 + # - enriched_taxonomy_attributes.value
196 157 - option1_values
197 158 - option2_values
198 159 - option3_values
... ... @@ -202,13 +163,14 @@ query_config:
202 163 # - description
203 164 # - vendor
204 165 # shared_fields: 无语言后缀字段;示例: tags, option1_values, option2_values, option3_values
  166 +
205 167 shared_fields: null
206 168 core_multilingual_fields:
207 169 - title
208 170 - qanchors
209 171 - category_name_text
210 172  
211   - # 统一文本召回策略(主查询 + 翻译查询)
  173 + # 文本召回(主查询 + 翻译查询)
212 174 text_query_strategy:
213 175 base_minimum_should_match: 60%
214 176 translation_minimum_should_match: 60%
... ... @@ -223,14 +185,10 @@ query_config:
223 185 title: 5.0
224 186 qanchors: 4.0
225 187 phrase_match_boost: 3.0
226   -
227   - # Embedding字段名称
228 188 text_embedding_field: title_embedding
229 189 image_embedding_field: image_embedding.vector
230 190  
231   - # 返回字段配置(_source includes)
232   - # null表示返回所有字段,[]表示不返回任何字段,列表表示只返回指定字段
233   - # 下列字段与 api/result_formatter.py(SpuResult 填充)及 search/searcher.py(SKU 排序/主图替换)一致
  191 + # null表示返回所有字段,[]表示不返回任何字段
234 192 source_fields:
235 193 - spu_id
236 194 - handle
... ... @@ -251,6 +209,8 @@ query_config:
251 209 # - qanchors
252 210 # - enriched_tags
253 211 # - enriched_attributes
  212 + # - # enriched_taxonomy_attributes.value
  213 +
254 214 - min_price
255 215 - compare_at_price
256 216 - image_url
... ... @@ -270,26 +230,21 @@ query_config:
270 230 # KNN:文本向量与多模态(图片)向量各自 boost 与召回(k / num_candidates)
271 231 knn_text_boost: 4
272 232 knn_image_boost: 4
273   -
274   - # knn_text_num_candidates = k * 3.4
275 233 knn_text_k: 160
276   - knn_text_num_candidates: 560
  234 + knn_text_num_candidates: 560 # k * 3.4
277 235 knn_text_k_long: 400
278 236 knn_text_num_candidates_long: 1200
279 237 knn_image_k: 400
280 238 knn_image_num_candidates: 1200
281 239  
282   -# Function Score配置(ES层打分规则)
283 240 function_score:
284 241 score_mode: sum
285 242 boost_mode: multiply
286 243 functions: []
287   -
288   -# 粗排配置(仅融合 ES 文本/向量信号,不调用模型)
289 244 coarse_rank:
290 245 enabled: true
291   - input_window: 700
292   - output_window: 240
  246 + input_window: 480
  247 + output_window: 160
293 248 fusion:
294 249 es_bias: 10.0
295 250 es_exponent: 0.05
... ... @@ -301,30 +256,29 @@ coarse_rank:
301 256 knn_text_weight: 1.0
302 257 knn_image_weight: 2.0
303 258 knn_tie_breaker: 0.3
304   - knn_bias: 0.6
305   - knn_exponent: 0.4
306   -
307   -# 精排配置(轻量 reranker)
  259 + knn_bias: 0.0
  260 + knn_exponent: 5.6
  261 + knn_text_exponent: 0.0
  262 + knn_image_exponent: 0.0
308 263 fine_rank:
309   - enabled: false
  264 + enabled: false # false 时保序透传
310 265 input_window: 160
311 266 output_window: 80
312 267 timeout_sec: 10.0
313 268 rerank_query_template: '{query}'
314 269 rerank_doc_template: '{title}'
315 270 service_profile: fine
316   -
317   -# 重排配置(provider/URL 在 services.rerank)
318 271 rerank:
319   - enabled: true
  272 + enabled: false # false 时保序透传
320 273 rerank_window: 160
  274 + exact_knn_rescore_enabled: true
  275 + exact_knn_rescore_window: 160
321 276 timeout_sec: 15.0
322 277 weight_es: 0.4
323 278 weight_ai: 0.6
324 279 rerank_query_template: '{query}'
325 280 rerank_doc_template: '{title}'
326 281 service_profile: default
327   -
328 282 # 乘法融合:fused = Π (max(score,0) + bias) ** exponent(es / rerank / fine / text / knn)
329 283 # 其中 knn_score 先做一层 dis_max:
330 284 # max(knn_text_weight * text_knn, knn_image_weight * image_knn)
... ... @@ -337,30 +291,28 @@ rerank:
337 291 fine_bias: 0.1
338 292 fine_exponent: 1.0
339 293 text_bias: 0.1
340   - text_exponent: 0.25
341 294 # base_query_trans_* 相对 base_query 的权重(见 search/rerank_client 中文本 dismax 融合)
  295 + text_exponent: 0.25
342 296 text_translation_weight: 0.8
343 297 knn_text_weight: 1.0
344 298 knn_image_weight: 2.0
345 299 knn_tie_breaker: 0.3
346   - knn_bias: 0.6
347   - knn_exponent: 0.4
  300 + knn_bias: 0.0
  301 + knn_exponent: 5.6
348 302  
349   -# 可扩展服务/provider 注册表(单一配置源)
350 303 services:
351 304 translation:
352 305 service_url: http://127.0.0.1:6006
353   - # default_model: nllb-200-distilled-600m
354 306 default_model: nllb-200-distilled-600m
355 307 default_scene: general
356 308 timeout_sec: 10.0
357 309 cache:
358 310 ttl_seconds: 62208000
359 311 sliding_expiration: true
360   - # When false, cache keys are exact-match per request model only (ignores model_quality_tiers for lookups).
361   - enable_model_quality_tier_cache: true
  312 + # When false, cache keys are exact-match per request model only (ignores model_quality_tiers for lookups)
362 313 # Higher tier = better quality. Multiple models may share one tier (同级).
363 314 # A request may reuse Redis keys from models with tier > A or tier == A (not from lower tiers).
  315 + enable_model_quality_tier_cache: true
364 316 model_quality_tiers:
365 317 deepl: 30
366 318 qwen-mt: 30
... ... @@ -454,13 +406,12 @@ services:
454 406 num_beams: 1
455 407 use_cache: true
456 408 embedding:
457   - provider: http # http
  409 + provider: http
458 410 providers:
459 411 http:
460 412 text_base_url: http://127.0.0.1:6005
461 413 image_base_url: http://127.0.0.1:6008
462   - # 服务内文本后端(embedding 进程启动时读取)
463   - backend: tei # tei | local_st
  414 + backend: tei
464 415 backends:
465 416 tei:
466 417 base_url: http://127.0.0.1:8080
... ... @@ -500,13 +451,13 @@ services:
500 451 request:
501 452 max_docs: 1000
502 453 normalize: true
503   - default_instance: default
504 454 # 命名实例:同一套 reranker 代码按实例名读取不同端口 / 后端 / runtime 目录。
  455 + default_instance: default
505 456 instances:
506 457 default:
507 458 host: 0.0.0.0
508 459 port: 6007
509   - backend: qwen3_vllm_score
  460 + backend: bge
510 461 runtime_dir: ./.runtime/reranker/default
511 462 fine:
512 463 host: 0.0.0.0
... ... @@ -543,6 +494,7 @@ services:
543 494 enforce_eager: false
544 495 infer_batch_size: 100
545 496 sort_by_doc_length: true
  497 +
546 498 # standard=_format_instruction__standard(固定 yes/no system);compact=_format_instruction(instruction 作 system 且 user 内重复 Instruct)
547 499 instruction_format: standard # compact standard
548 500 # instruction: "Given a query, score the product for relevance"
... ... @@ -556,6 +508,7 @@ services:
556 508 # instruction: "Rank products by query with category & style match prioritized"
557 509 # instruction: "Given a fashion shopping query, retrieve relevant products that answer the query"
558 510 instruction: rank products by given query
  511 +
559 512 # vLLM LLM.score()(跨编码打分)。独立高性能环境 .venv-reranker-score(vllm 0.18 固定版):./scripts/setup_reranker_venv.sh qwen3_vllm_score
560 513 # 与 qwen3_vllm 可共用同一 model_name / HF 缓存;venv 分离以便升级 vLLM 而不影响 generate 后端。
561 514 qwen3_vllm_score:
... ... @@ -583,15 +536,10 @@ services:
583 536 qwen3_transformers:
584 537 model_name: Qwen/Qwen3-Reranker-0.6B
585 538 instruction: rank products by given query
586   - # instruction: "Score the product’s relevance to the given query"
587 539 max_length: 8192
588 540 batch_size: 64
589 541 use_fp16: true
590   - # sdpa:默认无需 flash-attn;若已安装 flash_attn 可改为 flash_attention_2
591 542 attn_implementation: sdpa
592   - # Packed Transformers backend: shared query prefix + custom position_ids/attention_mask.
593   - # For 1 query + many short docs (for example 400 product titles), this usually reduces
594   - # repeated prefix work and padding waste compared with pairwise batching.
595 543 qwen3_transformers_packed:
596 544 model_name: Qwen/Qwen3-Reranker-0.6B
597 545 instruction: Rank products by query with category & style match prioritized
... ... @@ -600,8 +548,6 @@ services:
600 548 max_docs_per_pack: 0
601 549 use_fp16: true
602 550 sort_by_doc_length: true
603   - # Packed mode relies on a custom 4D attention mask. "eager" is the safest default.
604   - # If your torch/transformers stack validates it, you can benchmark "sdpa".
605 551 attn_implementation: eager
606 552 qwen3_gguf:
607 553 repo_id: DevQuasar/Qwen.Qwen3-Reranker-4B-GGUF
... ... @@ -609,7 +555,6 @@ services:
609 555 cache_dir: ./model_cache
610 556 local_dir: ./models/reranker/qwen3-reranker-4b-gguf
611 557 instruction: Rank products by query with category & style match prioritized
612   - # T4 16GB / 性能优先配置:全量层 offload,实测比保守配置明显更快
613 558 n_ctx: 512
614 559 n_batch: 512
615 560 n_ubatch: 512
... ... @@ -632,8 +577,6 @@ services:
632 577 cache_dir: ./model_cache
633 578 local_dir: ./models/reranker/qwen3-reranker-0.6b-q8_0-gguf
634 579 instruction: Rank products by query with category & style match prioritized
635   - # 0.6B GGUF / online rerank baseline:
636   - # 实测 400 titles 单请求约 265s,因此它更适合作为低显存功能后备,不适合在线低延迟主路由。
637 580 n_ctx: 256
638 581 n_batch: 256
639 582 n_ubatch: 256
... ... @@ -653,20 +596,15 @@ services:
653 596 verbose: false
654 597 dashscope_rerank:
655 598 model_name: qwen3-rerank
656   - # 按地域选择 endpoint:
657   - # 中国: https://dashscope.aliyuncs.com/compatible-api/v1/reranks
658   - # 新加坡: https://dashscope-intl.aliyuncs.com/compatible-api/v1/reranks
659   - # 美国: https://dashscope-us.aliyuncs.com/compatible-api/v1/reranks
660 599 endpoint: https://dashscope.aliyuncs.com/compatible-api/v1/reranks
661 600 api_key_env: RERANK_DASHSCOPE_API_KEY_CN
662 601 timeout_sec: 10.0
663   - top_n_cap: 0 # 0 表示 top_n=当前请求文档数;>0 则限制 top_n 上限
664   - batchsize: 64 # 0 关闭;>0 启用并发小包调度(top_n/top_n_cap 仍生效,分包后全局截断)
  602 + top_n_cap: 0 # 0 表示 top_n=当前请求文档数
  603 + batchsize: 64 # 0 关闭;>0 启用并发小包调度(top_n/top_n_cap 仍生效,分包后全局截断)
665 604 instruct: Given a shopping query, rank product titles by relevance
666 605 max_retries: 2
667 606 retry_backoff_sec: 0.2
668 607  
669   -# SPU配置(已启用,使用嵌套skus)
670 608 spu_config:
671 609 enabled: true
672 610 spu_field: spu_id
... ... @@ -678,7 +616,6 @@ spu_config:
678 616 - option2
679 617 - option3
680 618  
681   -# 租户配置(Tenant Configuration)
682 619 # 每个租户可配置主语言 primary_language 与索引语言 index_languages(主市场语言,商家可勾选)
683 620 # 默认 index_languages: [en, zh],可配置为任意 SOURCE_LANG_CODE_MAP.keys() 的子集
684 621 tenant_config:
... ...
config/loader.py
... ... @@ -587,6 +587,14 @@ class AppConfigLoader:
587 587 knn_tie_breaker=float(coarse_fusion_raw.get("knn_tie_breaker", 0.0)),
588 588 knn_bias=float(coarse_fusion_raw.get("knn_bias", 0.6)),
589 589 knn_exponent=float(coarse_fusion_raw.get("knn_exponent", 0.2)),
  590 + knn_text_bias=float(
  591 + coarse_fusion_raw.get("knn_text_bias", coarse_fusion_raw.get("knn_bias", 0.6))
  592 + ),
  593 + knn_text_exponent=float(coarse_fusion_raw.get("knn_text_exponent", 0.0)),
  594 + knn_image_bias=float(
  595 + coarse_fusion_raw.get("knn_image_bias", coarse_fusion_raw.get("knn_bias", 0.6))
  596 + ),
  597 + knn_image_exponent=float(coarse_fusion_raw.get("knn_image_exponent", 0.0)),
590 598 text_translation_weight=float(
591 599 coarse_fusion_raw.get("text_translation_weight", 0.8)
592 600 ),
... ... @@ -608,6 +616,12 @@ class AppConfigLoader:
608 616 rerank=RerankConfig(
609 617 enabled=bool(rerank_cfg.get("enabled", True)),
610 618 rerank_window=int(rerank_cfg.get("rerank_window", 384)),
  619 + exact_knn_rescore_enabled=bool(
  620 + rerank_cfg.get("exact_knn_rescore_enabled", False)
  621 + ),
  622 + exact_knn_rescore_window=int(
  623 + rerank_cfg.get("exact_knn_rescore_window", 0)
  624 + ),
611 625 timeout_sec=float(rerank_cfg.get("timeout_sec", 15.0)),
612 626 weight_es=float(rerank_cfg.get("weight_es", 0.4)),
613 627 weight_ai=float(rerank_cfg.get("weight_ai", 0.6)),
... ... @@ -630,6 +644,14 @@ class AppConfigLoader:
630 644 knn_tie_breaker=float(fusion_raw.get("knn_tie_breaker", 0.0)),
631 645 knn_bias=float(fusion_raw.get("knn_bias", 0.6)),
632 646 knn_exponent=float(fusion_raw.get("knn_exponent", 0.2)),
  647 + knn_text_bias=float(
  648 + fusion_raw.get("knn_text_bias", fusion_raw.get("knn_bias", 0.6))
  649 + ),
  650 + knn_text_exponent=float(fusion_raw.get("knn_text_exponent", 0.0)),
  651 + knn_image_bias=float(
  652 + fusion_raw.get("knn_image_bias", fusion_raw.get("knn_bias", 0.6))
  653 + ),
  654 + knn_image_exponent=float(fusion_raw.get("knn_image_exponent", 0.0)),
633 655 fine_bias=float(fusion_raw.get("fine_bias", 0.00001)),
634 656 fine_exponent=float(fusion_raw.get("fine_exponent", 1.0)),
635 657 text_translation_weight=float(
... ... @@ -655,6 +677,14 @@ class AppConfigLoader:
655 677  
656 678 translation_raw = raw.get("translation") if isinstance(raw.get("translation"), dict) else {}
657 679 normalized_translation = build_translation_config(translation_raw)
  680 + local_translation_backends = {"local_nllb", "local_marian"}
  681 + for capability_name, capability_cfg in normalized_translation["capabilities"].items():
  682 + backend_name = str(capability_cfg.get("backend") or "").strip().lower()
  683 + if backend_name not in local_translation_backends:
  684 + continue
  685 + for path_key in ("model_dir", "ct2_model_dir"):
  686 + if capability_cfg.get(path_key) not in (None, ""):
  687 + capability_cfg[path_key] = str(self._resolve_project_path_value(capability_cfg[path_key]).resolve())
658 688 translation_config = TranslationServiceConfig(
659 689 endpoint=str(normalized_translation["service_url"]).rstrip("/"),
660 690 timeout_sec=float(normalized_translation["timeout_sec"]),
... ... @@ -749,7 +779,7 @@ class AppConfigLoader:
749 779 port=port,
750 780 backend=backend_name,
751 781 runtime_dir=(
752   - str(v)
  782 + str(self._resolve_project_path_value(v).resolve())
753 783 if (v := instance_raw.get("runtime_dir")) not in (None, "")
754 784 else None
755 785 ),
... ... @@ -787,6 +817,12 @@ class AppConfigLoader:
787 817 rerank=rerank_config,
788 818 )
789 819  
  820 + def _resolve_project_path_value(self, value: Any) -> Path:
  821 + candidate = Path(str(value)).expanduser()
  822 + if candidate.is_absolute():
  823 + return candidate
  824 + return self.project_root / candidate
  825 +
790 826 def _build_tenants_config(self, raw: Dict[str, Any]) -> TenantCatalogConfig:
791 827 if not isinstance(raw, dict):
792 828 raise ConfigurationError("tenant_config must be a mapping")
... ...
config/schema.py
... ... @@ -119,6 +119,18 @@ class RerankFusionConfig:
119 119 knn_tie_breaker: float = 0.0
120 120 knn_bias: float = 0.6
121 121 knn_exponent: float = 0.2
  122 + #: Optional additive floor for the weighted text KNN term.
  123 + #: Falls back to knn_bias when omitted in config loading.
  124 + knn_text_bias: float = 0.6
  125 + #: Optional extra multiplicative term on weighted text KNN.
  126 + #: Uses knn_text_bias as the additive floor.
  127 + knn_text_exponent: float = 0.0
  128 + #: Optional additive floor for the weighted image KNN term.
  129 + #: Falls back to knn_bias when omitted in config loading.
  130 + knn_image_bias: float = 0.6
  131 + #: Optional extra multiplicative term on weighted image KNN.
  132 + #: Uses knn_image_bias as the additive floor.
  133 + knn_image_exponent: float = 0.0
122 134 fine_bias: float = 0.00001
123 135 fine_exponent: float = 1.0
124 136 #: 翻译子句 named query 分数相对原文 base_query 的权重(加权后再与原文做 dismax 融合)
... ... @@ -143,6 +155,18 @@ class CoarseRankFusionConfig:
143 155 knn_tie_breaker: float = 0.0
144 156 knn_bias: float = 0.6
145 157 knn_exponent: float = 0.2
  158 + #: Optional additive floor for the weighted text KNN term.
  159 + #: Falls back to knn_bias when omitted in config loading.
  160 + knn_text_bias: float = 0.6
  161 + #: Optional extra multiplicative term on weighted text KNN.
  162 + #: Uses knn_text_bias as the additive floor.
  163 + knn_text_exponent: float = 0.0
  164 + #: Optional additive floor for the weighted image KNN term.
  165 + #: Falls back to knn_bias when omitted in config loading.
  166 + knn_image_bias: float = 0.6
  167 + #: Optional extra multiplicative term on weighted image KNN.
  168 + #: Uses knn_image_bias as the additive floor.
  169 + knn_image_exponent: float = 0.0
146 170 #: 翻译子句 named query 分数相对原文 base_query 的权重(加权后再与原文做 dismax 融合)
147 171 text_translation_weight: float = 0.8
148 172  
... ... @@ -176,6 +200,9 @@ class RerankConfig:
176 200  
177 201 enabled: bool = True
178 202 rerank_window: int = 384
  203 + exact_knn_rescore_enabled: bool = False
  204 + #: topN exact vector scoring window; <=0 means "follow rerank_window"
  205 + exact_knn_rescore_window: int = 0
179 206 timeout_sec: float = 15.0
180 207 weight_es: float = 0.4
181 208 weight_ai: float = 0.6
... ...
docs/DEVELOPER_GUIDE.md
... ... @@ -389,7 +389,7 @@ services:
389 389 - **位置**:`tests/`,可按 `unit/`、`integration/` 或按模块划分子目录;公共 fixture 在 `conftest.py`。
390 390 - **标记**:使用 `@pytest.mark.unit`、`@pytest.mark.integration`、`@pytest.mark.api` 等区分用例类型,便于按需运行。
391 391 - **依赖**:单元测试通过 mock(如 `mock_es_client`、`sample_search_config`)不依赖真实 ES/DB;集成测试需在说明中注明依赖服务。
392   -- **运行**:`python -m pytest tests/`;仅单元:`python -m pytest tests/unit/` 或 `-m unit`
  392 +- **运行**:`python -m pytest tests/`;推荐最小回归:`python -m pytest tests/ci -q`;按模块聚焦可直接指定具体测试文件
393 393 - **原则**:新增逻辑应有对应测试;修改协议或配置契约时更新相关测试与 fixture。
394 394  
395 395 ### 8.3 配置与环境
... ...
docs/QUICKSTART.md
... ... @@ -69,7 +69,7 @@ source activate.sh
69 69 ./run.sh all
70 70 # 仅为薄封装:等价于 ./scripts/service_ctl.sh up all
71 71 # 说明:
72   -# - all = tei cnclip embedding embedding-image translator reranker reranker-fine backend indexer frontend eval-web
  72 +# - all = tei cnclip embedding embedding-image translator reranker backend indexer frontend eval-web
73 73 # - up 会同时启动 monitor daemon(运行期连续失败自动重启)
74 74 # - reranker 为 GPU 强制模式(资源不足会直接启动失败)
75 75 # - TEI 默认使用 GPU;当 TEI_DEVICE=cuda 且 GPU 不可用时会直接失败(不会自动降级到 CPU)
... ... @@ -166,7 +166,7 @@ curl -X POST http://localhost:6008/embed/image \
166 166  
167 167 ```bash
168 168 ./scripts/setup_translator_venv.sh
169   -./.venv-translator/bin/python scripts/download_translation_models.py --all-local # 如需本地模型
  169 +./.venv-translator/bin/python scripts/translation/download_translation_models.py --all-local # 如需本地模型
170 170 ./scripts/start_translator.sh
171 171  
172 172 curl -X POST http://localhost:6006/translate \
... ...
docs/Usage-Guide.md
... ... @@ -126,7 +126,7 @@ cd /data/saas-search
126 126  
127 127 这个脚本会自动:
128 128 1. 创建日志目录
129   -2. 按目标启动服务(`all`:`tei cnclip embedding embedding-image translator reranker reranker-fine backend indexer frontend eval-web`)
  129 +2. 按目标启动服务(`all`:`tei cnclip embedding embedding-image translator reranker backend indexer frontend eval-web`)
130 130 3. 写入 PID 到 `logs/*.pid`
131 131 4. 执行健康检查
132 132 5. 启动 monitor daemon(运行期连续失败自动重启)
... ... @@ -202,7 +202,7 @@ python -m pytest -q tests/test_rerank_client.py tests/test_es_query_builder.py t
202 202 ./scripts/service_ctl.sh restart backend
203 203 sleep 3
204 204 ./scripts/service_ctl.sh status backend
205   -./scripts/evaluation/start_eval.sh.sh batch
  205 +./scripts/evaluation/start_eval.sh batch
206 206 ```
207 207  
208 208 离线批量评估会把标注与报表写到 `artifacts/search_evaluation/`(SQLite、`batch_reports/` 下的 JSON/Markdown 等)。说明与命令见 [scripts/evaluation/README.md](../scripts/evaluation/README.md)。
... ...
docs/caches-inventory.md 0 → 100644
... ... @@ -0,0 +1,133 @@
  1 +# 本项目缓存一览
  2 +
  3 +本文档梳理仓库内**与业务相关的各类缓存**:说明用途、键与过期策略,并汇总运维脚本。按「分布式(Redis)→ 进程内 → 磁盘/模型 → 第三方」组织。
  4 +
  5 +---
  6 +
  7 +## 一、Redis 集中式缓存(生产主路径)
  8 +
  9 +所有下列缓存默认连接 **`infrastructure.redis`**(`config/config.yaml` 与 `REDIS_*` 环境变量),**数据库编号一般为 `db=0`**(脚本可通过参数覆盖)。`snapshot_db` 仅在配置中存在,供快照/运维场景选用,应用代码未按该字段切换业务缓存的 DB。
  10 +
  11 +### 1. 文本 / 图像向量缓存(Embedding)
  12 +
  13 +- **作用**:缓存 BGE/TEI 文本向量与 CN-CLIP 图像向量、CLIP 文本塔向量,避免重复推理。
  14 +- **实现**:`embeddings/redis_embedding_cache.py` 的 `RedisEmbeddingCache`;键构造见 `embeddings/cache_keys.py`。
  15 +- **Key 形态**(最终 Redis 键 = `前缀` + `可选 namespace` + `逻辑键`):
  16 + - **前缀**:`infrastructure.redis.embedding_cache_prefix`(默认 `embedding`,可用 `REDIS_EMBEDDING_CACHE_PREFIX` 覆盖)。
  17 + - **命名空间**:`embeddings/server.py` 与客户端中分为:
  18 + - 文本:`namespace=""` → `{prefix}:{embed:norm0|1:...}`
  19 + - 图像:`namespace="image"` → `{prefix}:image:{embed:模型名:txt:norm0|1:...}`
  20 + - CLIP 文本:`namespace="clip_text"` → `{prefix}:clip_text:{embed:模型名:img:norm0|1:...}`
  21 + - 逻辑键段含 `embed:`、`norm0/1`、模型名(多模态)、过长文本/URL 时用 `h:sha256:...` 摘要(见 `cache_keys.py` 注释)。
  22 +- **值格式**:BF16 压缩后的字节(`embeddings/bf16.py`),非 JSON。
  23 +- **TTL**:`infrastructure.redis.cache_expire_days`(默认 **720 天**,`REDIS_CACHE_EXPIRE_DAYS`)。写入用 `SETEX`;**命中时滑动续期**(`EXPIRE` 刷新为同一时长)。
  24 +- **Redis 客户端**:`decode_responses=False`(二进制)。
  25 +
  26 +**主要代码**:`embeddings/server.py`、`embeddings/text_encoder.py`、`embeddings/image_encoder.py`。
  27 +
  28 +---
  29 +
  30 +### 2. 翻译结果缓存(Translation)
  31 +
  32 +- **作用**:按「翻译模型 + 目标语言 + 原文」缓存译文;支持**模型质量分层探测**(高 tier 模型写入的缓存可被同 tier 或更高 tier 的请求命中,见 `translation/settings.py` 中 `translation_cache_probe_models`)。
  33 +- **Key 形态**:`trans:{model}:{target_lang}:{text前4字符}{sha256全文}`(`translation/cache.py` 的 `build_key`)。
  34 +- **值格式**:UTF-8 译文字符串。
  35 +- **TTL**:`services.translation.cache.ttl_seconds`(默认 **62208000 秒 = 720 天**)。若 `sliding_expiration: true`,命中时刷新 TTL。
  36 +- **能力级开关**:各 `capabilities.*.use_cache` 为 `false` 时该后端不落 Redis。
  37 +- **Redis 客户端**:`decode_responses=True`。
  38 +
  39 +**主要代码**:`translation/cache.py`、`translation/service.py`;翻译 HTTP 服务:`api/translator_app.py`(`get_translation_service()` 使用 `lru_cache` 单例,见下文进程内缓存)。
  40 +
  41 +---
  42 +
  43 +### 3. 商品内容理解 / Anchors 与语义分析缓存(Indexer)
  44 +
  45 +- **作用**:缓存 LLM 对商品标题等拼出的 **prompt 输入** 所做的分析结果(anchors、语义属性等),避免重复调用大模型。键与 `analysis_kind`、`prompt` 契约版本、`target_lang` 及输入摘要相关。
  46 +- **Key 形态**:`{anchor_cache_prefix}:{analysis_kind}:{prompt_contract_hash[:12]}:{target_lang}:{prompt_input[:4]}{md5}`(`indexer/product_enrich.py` 中 `_make_analysis_cache_key`)。
  47 +- **前缀**:`infrastructure.redis.anchor_cache_prefix`(默认 `product_anchors`,`REDIS_ANCHOR_CACHE_PREFIX`)。
  48 +- **值格式**:JSON 字符串(规范化后的分析结果)。
  49 +- **TTL**:`anchor_cache_expire_days`(默认 **30 天**),以秒写入 `SETEX`(**非滑动**,与向量/翻译不同)。
  50 +- **读逻辑**:无 TTL 刷新;仅校验内容是否「有意义」再返回。
  51 +
  52 +**主要代码**:`indexer/product_enrich.py`;与 HTTP 侧对齐说明见 `api/routes/indexer.py` 注释。
  53 +
  54 +---
  55 +
  56 +## 二、进程内缓存(非共享、随进程重启失效)
  57 +
  58 +| 名称 | 用途 | 范围/生命周期 |
  59 +|------|------|----------------|
  60 +| **`get_app_config()`** | 解析并缓存全局 `AppConfig` | `config/loader.py`:`@lru_cache(maxsize=1)`;`reload_app_config()` 可 `cache_clear()` |
  61 +| **`TranslationService` 单例** | 翻译服务进程内复用后端与 Redis 客户端 | `api/translator_app.py`:`get_translation_service()` |
  62 +| **`_nllb_tokenizer_code_by_normalized_key`** | NLLB tokenizer 语言码映射 | `translation/languages.py`:`@lru_cache(maxsize=1)` |
  63 +| **`QueryTextAnalysisCache`** | 单次查询解析内复用分词、tokenizer 结果 | `query/tokenization.py`,随 `QueryParser` 一次 parse |
  64 +| **`_SelectionContext`(SKU 意图)** | 归一化文本、分词、匹配布尔等小字典 | `search/sku_intent_selector.py`,单次选择流程 |
  65 +| **`incremental_service` transformer 缓存** | 按 `tenant_id` 缓存文档转换器 | `indexer/incremental_service.py`,**无界**、多租户进程长期存活时需注意内存 |
  66 +| **NLLB batch 内 `token_count_cache`** | 同一 batch 内避免重复计 token | `translation/backends/local_ctranslate2.py` |
  67 +| **CLIP 分词器 `@lru_cache`**(第三方) | 简单 tokenizer 缓存 | `third-party/clip-as-service/.../simple_tokenizer.py` |
  68 +
  69 +**说明**:`utils/cache.py` 中的 **`DictCache`**(文件 JSON:默认 `.cache/dict_cache.json`)已导出,但仓库内**无直接 `DictCache(` 调用**,视为可复用工具/预留,非当前主路径。
  70 +
  71 +---
  72 +
  73 +## 三、磁盘与模型相关「缓存」(非 Redis)
  74 +
  75 +| 名称 | 用途 | 配置/位置 |
  76 +|------|------|-----------|
  77 +| **Hugging Face / 本地模型目录** | 重排器、翻译本地模型等权重下载与缓存 | `services.rerank.backends.*.cache_dir` 等,常见默认 **`./model_cache`**(`config/config.yaml`) |
  78 +| **vLLM `enable_prefix_caching`** | 重排服务内 **Prefix KV 缓存**(加速同前缀批推理) | `services.rerank.backends.qwen3_vllm*`、`reranker/backends/qwen3_vllm*.py` |
  79 +| **运行时目录** | 重排服务状态/引擎文件 | `services.rerank.instances.*.runtime_dir`(如 `./.runtime/reranker/...`) |
  80 +
  81 +翻译能力里的 **`use_cache: true`**(如 NLLB、Marian)在多数后端指 **推理时的 KV cache(Transformer)**,与 Redis 译文缓存是不同层次;Redis 译文缓存仍由 `TranslationCache` 控制。
  82 +
  83 +---
  84 +
  85 +## 四、Elasticsearch 内部缓存
  86 +
  87 +索引设置中的 `refresh_interval` 等影响近实时可见性,但**不属于应用层键值缓存**。若需调优 ES 查询缓存、节点堆等,见运维文档与集群配置,此处不展开。
  88 +
  89 +---
  90 +
  91 +## 五、运维与巡检脚本(Redis)
  92 +
  93 +| 脚本 | 作用 |
  94 +|------|------|
  95 +| `scripts/redis/redis_cache_health_check.py` | 按 **embedding / translation / anchors** 三类前缀巡检:key 数量估算、TTL 采样、`IDLETIME` 等 |
  96 +| `scripts/redis/redis_cache_prefix_stats.py` | 按前缀统计 key 数量与 **MEMORY USAGE**(可多 DB) |
  97 +| `scripts/redis/redis_memory_heavy_keys.py` | 扫描占用内存最大的 key,辅助排查「统计与总内存不一致」 |
  98 +| `scripts/redis/monitor_eviction.py` | 实时监控 **eviction** 相关事件,用于容量与驱逐策略排查 |
  99 +
  100 +使用前需加载项目配置(如 `source activate.sh`)以保证 `REDIS_CONFIG` 与生产一致。脚本注释中给出了 **`redis-cli` 手工统计**示例(按前缀 `wc -l`、`MEMORY STATS` 等)。
  101 +
  102 +---
  103 +
  104 +## 六、总表(Redis 与各层缓存)
  105 +
  106 +| 缓存名称 | 业务模块 | 存储 | Key 前缀 / 命名模式 | 过期时间 | 过期策略 | 值摘要 | 配置键 / 环境变量 |
  107 +|----------|----------|------|---------------------|----------|----------|--------|-------------------|
  108 +| 文本向量 | 检索 / 索引 / Embedding 服务 | Redis db≈0 | `{embedding_cache_prefix}:*`(逻辑键以 `embed:norm…` 开头) | `cache_expire_days`(默认 720 天) | 写入 TTL + 命中滑动续期 | BF16 字节向量 | `infrastructure.redis.*`;`REDIS_EMBEDDING_CACHE_PREFIX`、`REDIS_CACHE_EXPIRE_DAYS` |
  109 +| 图像向量(CLIP 图) | 图搜 / 多模态 | 同上 | `{prefix}:image:*` | 同上 | 同上 | BF16 字节 | 同上 |
  110 +| CLIP 文本塔向量 | 图搜文本侧 | 同上 | `{prefix}:clip_text:*` | 同上 | 同上 | BF16 字节 | 同上 |
  111 +| 翻译译文 | 查询翻译、翻译服务 | 同上 | `trans:{model}:{lang}:*` | `services.translation.cache.ttl_seconds`(默认 720 天) | 可配置滑动(`sliding_expiration`) | UTF-8 字符串 | `services.translation.cache.*`;各能力 `use_cache` |
  112 +| 商品分析 / Anchors | 索引富化、LLM 内容理解 | 同上 | `{anchor_cache_prefix}:{kind}:{hash}:{lang}:*` | `anchor_cache_expire_days`(默认 30 天) | 固定 TTL,不滑动 | JSON 字符串 | `anchor_cache_prefix`、`anchor_cache_expire_days`;`REDIS_ANCHOR_*` |
  113 +| 应用配置 | 全栈 | 进程内存 | N/A(单例) | 进程生命周期 | `reload_app_config` 清除 | `AppConfig` 对象 | `config/loader.py` |
  114 +| 翻译服务实例 | 翻译 API | 进程内存 | N/A | 进程生命周期 | 单例 | `TranslationService` | `api/translator_app.py` |
  115 +| 查询分词缓存 | 查询解析 | 单次请求内 | N/A | 单次 parse | — | 分词与中间结果 | `query/tokenization.py` |
  116 +| SKU 意图辅助字典 | 搜索排序辅助 | 单次请求内 | N/A | 单次选择 | — | 小 dict | `search/sku_intent_selector.py` |
  117 +| 增量索引 Transformer | 索引管道 | 进程内存 | `tenant_id` 字符串键 | 长期(无界) | 无自动淘汰 | Transformer 元组 | `indexer/incremental_service.py` |
  118 +| 重排 / 翻译模型权重 | 推理服务 | 本地磁盘 | 目录路径 | 无自动删除(人工清理) | — | 模型文件 | `cache_dir: ./model_cache` 等 |
  119 +| vLLM Prefix 缓存 | 重排(Qwen3 等) | GPU/引擎内 | 引擎内部 | 引擎管理 | — | KV Cache | `enable_prefix_caching` |
  120 +| 文件 Dict 缓存(可选) | 通用 | `.cache/dict_cache.json` | 分类 + 自定义 key | 持久直至删除 | — | JSON 可序列化值 | `utils/cache.py`(当前无调用方) |
  121 +
  122 +---
  123 +
  124 +## 七、维护建议(简要)
  125 +
  126 +1. **容量**:三类 Redis 缓存(embedding / trans / anchors)可共用同一实例;大租户或图搜多时 **embedding** 与 **trans** 往往占主要内存,可用 `redis_cache_prefix_stats.py` 分前缀观察。
  127 +2. **键迁移**:变更 `embedding_cache_prefix`、CLIP `model_name` 或 prompt 契约会自然**隔离新键空间**;旧键依赖 TTL 或人工批量删除。
  128 +3. **一致性**:向量缓存对异常向量会 **delete key**(`RedisEmbeddingCache.get`);anchors 依赖 `cache_version` 与契约 hash 防止错误复用。
  129 +4. **监控**:除脚本外,Embedding HTTP 服务健康检查会报告各 lane 的 **`cache_enabled`**(`embeddings/server.py`)。
  130 +
  131 +---
  132 +
  133 +*文档随代码扫描生成;若新增 Redis 用途,请同步更新本文件与 `scripts/redis/redis_cache_health_check.py` 中的 `_load_known_cache_types()`。*
... ...
docs/issues/issue-2026-04-08-eval框架主指标ERR的问题以及bm25调参-done-0408.md 0 → 100644
... ... @@ -0,0 +1,120 @@
  1 +1. 目前检索系统评测的主要指标是这几个
  2 + "NDCG@20, NDCG@50, ERR@10, Strong_Precision@10, Strong_Precision@20, "
  3 +参考_err_at_k,计算逻辑好像没问题
  4 +现在的问题是,ERR 指标跟其他几个指标好像经常有相反的趋势。请再分析他是否适合作为主指标之一,目前有什么问题。
  5 +
  6 +2. 目前bm25参数是:
  7 +"b": 0.1,
  8 +"k1": 0.3
  9 +对应的基线是 /data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260408T055948Z_00b6a8aa3d.md (Primary_Metric_Score: 0.604555
  10 +
  11 +)
  12 +
  13 +(比之前b和k1都设置为0好了很多,之前都设置为0的情况:/data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260407T150946Z_00b6a8aa3d.md
  14 + Primary_Metric_Score: 0.602598
  15 +
  16 +)
  17 +
  18 +这两个参数从0改为0.1/0.3的背景是:
  19 +This change adjusts the BM25 parameters used by the combined query.
  20 +
  21 +Previously, both `b` and `k1` were set to `0.0`. The original intention was to avoid two common issues in e-commerce search relevance:
  22 +
  23 +1. Over-penalizing longer product titles
  24 + In product search, a shorter title should not automatically rank higher just because BM25 favors shorter fields. For example, for a query like “遥控车”, a product whose title is simply “遥控车” is not necessarily a better candidate than a product with a slightly longer but more descriptive title. In practice, extremely short titles may even indicate lower-quality catalog data.
  25 +
  26 +2. Over-rewarding repeated occurrences of the same term
  27 + For longer queries such as “遥控喷雾翻滚多功能车玩具车”, the default BM25 behavior may give too much weight to a term that appears multiple times (for example “遥控”), even when other important query terms such as “喷雾” or “翻滚” are missing. This can cause products with repeated partial matches to outrank products that actually cover more of the user intent.
  28 +
  29 +Setting both parameters to zero was an intentional way to suppress length normalization and term-frequency amplification. However, after introducing a `combined_fields` query, this configuration becomes too aggressive. Since `combined_fields` scores multiple fields as a unified relevance signal, completely disabling both effects may also remove useful ranking information, especially when we still want documents matching more query terms across fields to be distinguishable from weaker matches.
  30 +
  31 +This update therefore relaxes the previous setting and reintroduces a controlled amount of BM25 normalization/scoring behavior. The goal is to keep the original intent — avoiding short-title bias and excessive repeated-term gain — while allowing the combined query to better preserve meaningful relevance differences across candidates.
  32 +
  33 +Expected effect:
  34 +- reduce the bias toward unnaturally short product titles
  35 +- limit score inflation caused by repeated occurrences of the same term
  36 +- improve ranking stability for `combined_fields` queries
  37 +- better reward candidates that cover more of the overall query intent, instead of those that only repeat a subset of terms
  38 +
  39 +
  40 +因为实验有效,因此帮我继续进行实验
  41 +
  42 +请帮我再进行这四轮实验,对比效果,优化bm25参数:
  43 +{ "b": 0.10, "k1": 0.30 }
  44 +{ "b": 0.20, "k1": 0.60 }
  45 +{ "b": 0.50, "k1": 1.0 }
  46 +{ "b": 0.10, "k1": 0.75 }
  47 +
  48 +参考修改索引级设置的方法:( BM25 `similarity.default`)
  49 +
  50 +`mappings/search_products.json` 里的 `settings.similarity` 只在**创建新索引**时生效;**已有索引**需先关闭索引,再 `PUT _settings`,最后重新打开。
  51 +
  52 +**适用场景**:调整默认 BM25 的 `b`、`k1`(例如与仓库映射对齐:`b: 0.1`、`k1: 0.3`)。
  53 +
  54 +```bash
  55 +# 按需替换:索引名、账号密码、ES 地址
  56 +INDEX="search_products_tenant_163"
  57 +AUTH='saas:4hOaLaf41y2VuI8y'
  58 +ES="http://localhost:9200"
  59 +
  60 +# 1) 关闭索引(写入类请求会失败,注意维护窗口)
  61 +curl -s -u "$AUTH" -X POST "$ES/${INDEX}/_close"
  62 +
  63 +# 2) 更新设置(仅示例:与 mappings 中 default 一致时可照抄)
  64 +curl -s -u "$AUTH" -X PUT "$ES/${INDEX}/_settings" \
  65 + -H 'Content-Type: application/json' \
  66 + -d '{
  67 + "index": {
  68 + "similarity": {
  69 + "default": {
  70 + "type": "BM25",
  71 + "b": 0.1,
  72 + "k1": 0.3
  73 + }
  74 + }
  75 + }
  76 +}'
  77 +
  78 +# 3) 重新打开索引
  79 +curl -s -u "$AUTH" -X POST "$ES/${INDEX}/_open"
  80 +```
  81 +
  82 +**检查是否生效**:
  83 +
  84 +```bash
  85 +curl -s -u "$AUTH" -X GET "$ES/${INDEX}/_settings?filter_path=**.similarity&pretty"
  86 +```
  87 +
  88 +期望在响应中看到 `similarity.default` 的 `type`、`b`、`k1`(API 可能将数值以字符串形式返回,属正常)。
  89 +
  90 +**多租户批量**:先列出索引,再对每个 `search_products_tenant_*` 重复上述 close → settings → open。
  91 +
  92 +```bash
  93 +curl -s -u "$AUTH" -X GET "$ES/_cat/indices/search_products_tenant_*?h=index&v"
  94 +```
  95 +
  96 +
  97 +
  98 +实验结果:
  99 +已完成四轮实验,并把 `search_products_tenant_163` 当前索引参数留在了最优组:`b=0.5, k1=1.0`。刚刚已校验生效。
  100 +
  101 +对比结果如下:
  102 +
  103 +- 基线当前线上组 `b=0.10, k1=0.30`:`0.604555`
  104 + 报告:[batch_20260408T055948Z_00b6a8aa3d.md](/data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260408T055948Z_00b6a8aa3d.md)
  105 +- 复跑 `b=0.10, k1=0.30`:`0.604298`
  106 + 报告:[batch_20260408T064011Z_00b6a8aa3d.json](/data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260408T064011Z_00b6a8aa3d.json)
  107 +- 实验 `b=0.20, k1=0.60`:`0.604274`
  108 + 报告:[batch_20260408T064123Z_00b6a8aa3d.json](/data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260408T064123Z_00b6a8aa3d.json)
  109 +- 实验 `b=0.50, k1=1.0`:`0.606041`
  110 + 报告:[batch_20260408T064216Z_00b6a8aa3d.json](/data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260408T064216Z_00b6a8aa3d.json)
  111 +- 实验 `b=0.10, k1=0.75`:`0.603438`
  112 + 报告:[batch_20260408T064304Z_00b6a8aa3d.json](/data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260408T064304Z_00b6a8aa3d.json)
  113 +- 老基线 `b=0.0, k1=0.0`:`0.602598`
  114 + 报告:[batch_20260407T150946Z_00b6a8aa3d.md](/data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260407T150946Z_00b6a8aa3d.md)
  115 +
  116 +结论是这四组里 `b=0.50, k1=1.0` 最好,相比你给的当前基线 `0.604555` 提升了 `+0.001486`,相比 `0/0` 提升了 `+0.003443`。`0.20/0.60` 基本和 `0.10/0.30` 持平,`0.10/0.75` 反而更差,说明这轮里单独加大 `k1` 不如同时把 `b` 和 `k1` 一起放宽。
  117 +
  118 +补充一点:我只改了现有索引 `search_products_tenant_163` 的 index-level setting,没有改仓库里的默认映射。[search_products.json](/data/saas-search/mappings/search_products.json) 现在仍是 `0.1/0.3`。如果你要,我下一步可以直接帮你把映射默认值也改成 `0.5/1.0`,并顺手批量同步其他 `search_products_tenant_*` 索引。
  119 +
  120 +
... ...
docs/issues/issue-2026-04-12-test-env.md 0 → 100644
... ... @@ -0,0 +1,43 @@
  1 +120.76.41.98 端口22 用户名和密码:
  2 +tw twtw@123 (有sudo权限)
  3 +这台机器上的目录/home/tw/saas-search 已经部署了本项目
  4 +请帮我运行项目
  5 +1. 帮我checkout一个test环境的分支,这个分支,把重排、翻译模型 都关闭掉,因为这台机gpu显存较小(embedding模型可以保留)
  6 +2. 在这个分支,把服务都启动起来
  7 +3. 使用docker,安装一个ES,参考本项目的文档 ES9*.md。因为这台机器已经有一个系统的elasticsearch,为了不相互干扰,将本项目依赖的es9安装到docker,并且在测试环境配置的es地址做适配的工作
  8 +
  9 +
  10 +1. 不是要禁用6005,而是6005端口已经有对应的文本服务了,直接用就行
  11 +2. 6005其实就是本项目的一个历史早期版本启动起来的,在另外一个目录:/home/tw/SearchEngine,请看他的启动配置
  12 +nohup bash scripts/start_embedding_service.sh > log.start_embedding_service.0412 2>&1 &
  13 +是这样启动起来的
  14 +看他陪的文本是用的哪套方案、哪个模型,跟他对齐(我指的是当前的测试分支)
  15 +
  16 +
  17 +
  18 +
  19 +
  20 +
  21 +
  22 +我在这个机器上部署了一个测试环境:
  23 +120.76.41.98 端口22 用户名和密码:
  24 +tw twtw@123 (有sudo权限)
  25 +cd /home/tw/saas-search
  26 +$ git branch
  27 + masters RETURN)
  28 +* test/small-gpu-es9
  29 +
  30 +我希望差异只是:
  31 +1. es配置不同(测试环境要连接到那台机器的一个docer的es 19200端口)、redis配置不同
  32 +2. reranker关闭、不要启动reranker服务
  33 +
  34 +其余没什么不同。
  35 +
  36 +但是启动有问题,现在翻译报错。
  37 +这体现了当前项目移植性比较差,我希望你检查一下失败原因,然后先到本地(本机 即当前目录master分支)优化好、提升移植性之后,那边更新,保持测试分支跟master只有少量的、配置层面的不同,让后到测试机器把翻译启动起来,最后包括整个服务都要启动起来。
  38 +
  39 +
  40 +
  41 +
  42 +
  43 +
... ...
docs/issues/issue-2026-04-14-粗排流程放入ES-TODO-env 0 → 100644
... ... @@ -0,0 +1,25 @@
  1 +需求:
  2 +目前160条结果(rerank_window: 160)会进入重排,重排中 文本和图片向量的相关性,都会作为融合公式的因子之一(粗排和reranker都有):
  3 +knn_score
  4 +text_knn
  5 +image_knn
  6 +text_factor
  7 +knn_factor
  8 +但是文本向量召回和图片向量召回,是使用 KNN 索引召回的方式,并不是所有结果都有这两个得分,这两项得分都有为0的。
  9 +为了解决这个问题,有一个方法是对最终能进入重排的 160 条,看其中还有哪些分别缺失文本和图片向量召回的得分,再通过某种方式让 ES 去算,或者从 ES 把向量拉回来,自己算,或者在召回的时候请求 ES 的时候,就通过某种设定,确保前面的若干条都带有这两个分数,不知道有哪些方法,我感觉这些方法都不太好,请你思考一下
  10 +
  11 +考虑的一个方案:
  12 +想在“第一次 ES 搜索”里,只对 topN 补向量精算,考虑 rescore 或 retriever.rescorer的方案(官方明确支持多段 rescore/支持 score_mode: multiply,甚至示例里就有 function_score/script_score 放进 rescore 的写法。)
  13 +这意味着你完全可以:
  14 +初检仍然用现在的 lexical + text knn + image knn 召回候选
  15 +对 window_size=160 做 rescore
  16 +用 exact script_score 给 top160 补 text/image vector 分
  17 +顺手把你现在本地 coarse 融合迁回 ES
  18 +
  19 +export ES_AUTH="saas:4hOaLaf41y2VuI8y"
  20 +export ES="http://127.0.0.1:9200"
  21 +"index":"search_products_tenant_163"
  22 +
  23 +有个细节暴露出来了:dotProduct() 这类向量函数在 script_score 评分上下文能用,但在 script_fields 取字段上下文里不认。所以如果我们要把 exact 分顺手回传给 rerank,用 script_fields 的话得自己写数组循环,不能直接调向量内建函数。
  24 +
  25 +重排打分公式需要的base_query base_query_trans_zh knn_query image_knn_query还能不能拿到?请你考虑,尽量想想如何得到这些打分,如果实在拿不到去想替代的办法比如简化打分公式。
... ...
docs/工作总结-微服务性能优化与架构.md
... ... @@ -98,7 +98,7 @@ instruction: &quot;Given a shopping query, rank product titles by relevance&quot;
98 98 **能力**:支持根据商品标题批量生成 **qanchors**(锚文本)、**enriched_attributes**、**tags**,供索引与 suggest 使用。
99 99  
100 100 **具体内容**:
101   -- **接口**:`POST /indexer/enrich-content`(Indexer 服务端口 **6004**)。请求体为 `items` 数组,每项含 `spu_id`、`title`(必填)及可选多语言标题等;单次请求最多 **50 条**,建议批量调用。响应 `results` 与 `items` 一一对应,每项含 `spu_id`、`qanchors`(按语言键,如 `qanchors.zh`、`qanchors.en`,逗号分隔短语)、`enriched_attributes`、`tags`。
  101 +- **接口**:`POST /indexer/enrich-content`(FacetAwareMatching 服务端口 **6001**)。请求体为 `items` 数组,每项含 `spu_id`、`title`(必填)及可选多语言标题等;单次请求最多 **50 条**,建议批量调用。响应 `results` 与 `items` 一一对应,每项含 `spu_id`、`qanchors`(按语言键,如 `qanchors.zh`、`qanchors.en`,逗号分隔短语)、`enriched_attributes`、`tags`。
102 102 - **索引侧**:微服务组合方式下,调用方先拿不含 qanchors/tags 的 doc,再调用本接口补齐后写入 ES 的 `qanchors.{lang}` 等字段;索引 transformer(`indexer/document_transformer.py`、`indexer/product_enrich.py`)内也可在构建 doc 时调用内容理解逻辑,写入 `qanchors.{lang}`。
103 103 - **Suggest 侧**:`suggestion/builder.py` 从 ES 商品索引读取 `_source: ["id", "spu_id", "title", "qanchors"]`,对 `qanchors.{lang}` 用 `_split_qanchors` 拆成词条,以 `source="qanchor"` 加入候选,排序时 `qanchor` 权重大于纯 title(`add_product("qanchor", ...)`);suggest 配置中 `sources: ["query_log", "qanchor"]` 表示候选来源包含 qanchor。
104 104 - **实现与依赖**:内容理解内部使用大模型(需 `DASHSCOPE_API_KEY`),支持多语言与 Redis 缓存(如 `product_anchors`);逻辑与 `indexer/product_enrich` 一致。
... ... @@ -129,12 +129,12 @@ instruction: &quot;Given a shopping query, rank product titles by relevance&quot;
129 129 - 可选:embedding(text) **6005**、embedding-image **6008**、translator **6006**、reranker **6007**、tei **8080**、cnclip **51000**。
130 130 - 端口可由环境变量覆盖:`API_PORT`、`INDEXER_PORT`、`FRONTEND_PORT`、`EVAL_WEB_PORT`、`EMBEDDING_TEXT_PORT`、`EMBEDDING_IMAGE_PORT`、`TRANSLATION_PORT`、`RERANKER_PORT`、`TEI_PORT`、`CNCLIP_PORT`。
131 131 - **命令**:
132   - - `./scripts/service_ctl.sh start [service...]` 或 `up all` / `start all`(all 含 tei、cnclip、embedding、embedding-image、translator、reranker、reranker-fine、backend、indexer、frontend、eval-web,按依赖顺序);`stop`、`restart`、`down` 同参数;`status` 默认列出所有服务。
  132 + - `./scripts/service_ctl.sh start [service...]` 或 `up all` / `start all`(all 含 tei、cnclip、embedding、embedding-image、translator、reranker、backend、indexer、frontend、eval-web,按依赖顺序);`stop`、`restart`、`down` 同参数;`status` 默认列出所有服务。
133 133 - 启动时:backend/indexer/frontend/embedding/translator/reranker 会写 pid 到 `logs/<service>.pid`,并执行 `wait_for_health`(GET `http://127.0.0.1:<port>/health`);reranker 健康重试 90 次,其余 30 次;TEI 校验 Docker 容器存在且 `/health` 成功;cnclip 无 HTTP 健康则仅校验进程/端口。
134 134 - **监控常驻**:
135 135 - `./scripts/service_ctl.sh monitor-start <targets>` 启动后台监控进程,将 targets 写入 `logs/service-monitor.targets`,pid 写入 `logs/service-monitor.pid`,日志追加到 `logs/service-monitor.log`。
136   - - 轮询间隔 `MONITOR_INTERVAL_SEC` 默认 **10** 秒;连续 **3** 次(`MONITOR_FAIL_THRESHOLD`)健康失败则触发重启;重启冷却 `MONITOR_RESTART_COOLDOWN_SEC` 默认 **30** 秒;每小时最多重启 `MONITOR_MAX_RESTARTS_PER_HOUR` 默认 **6** 次;超限时调用 `scripts/wechat_alert.py` 告警(若存在)。
137   -- **日志**:各服务按日滚动到 `logs/<service>-<date>.log`,通过 `scripts/daily_log_router.sh` 与 `LOG_RETENTION_DAYS`(默认 30)控制保留。
  136 + - 轮询间隔 `MONITOR_INTERVAL_SEC` 默认 **10** 秒;连续 **3** 次(`MONITOR_FAIL_THRESHOLD`)健康失败则触发重启;重启冷却 `MONITOR_RESTART_COOLDOWN_SEC` 默认 **30** 秒;每小时最多重启 `MONITOR_MAX_RESTARTS_PER_HOUR` 默认 **6** 次;超限时调用 `scripts/ops/wechat_alert.py` 告警(若存在)。
  137 +- **日志**:各服务按日滚动到 `logs/<service>-<date>.log`,通过 `scripts/ops/daily_log_router.sh` 与 `LOG_RETENTION_DAYS`(默认 30)控制保留。
138 138  
139 139 详见:`scripts/service_ctl.sh` 内注释及 `docs/Usage-Guide.md`。
140 140  
... ... @@ -153,12 +153,12 @@ instruction: &quot;Given a shopping query, rank product titles by relevance&quot;
153 153  
154 154 ## 三、性能测试报告摘要
155 155  
156   -以下数据来自 `docs/性能测试报告.md`,测试时间 **2026-03-12**,环境:**8 vCPU**(Intel Xeon Platinum 8255C @ 2.50GHz)、**约 15Gi 可用内存**;租户 **162** 文档数约 **53**(search/search/suggestions/rerank 与文档规模相关)。压测工具:`scripts/perf_api_benchmark.py`,场景×并发矩阵,每档 **20s**。
  156 +以下数据来自 `docs/性能测试报告.md`,测试时间 **2026-03-12**,环境:**8 vCPU**(Intel Xeon Platinum 8255C @ 2.50GHz)、**约 15Gi 可用内存**;租户 **162** 文档数约 **53**(search/search/suggestions/rerank 与文档规模相关)。压测工具:`benchmarks/perf_api_benchmark.py`,场景×并发矩阵,每档 **20s**。
157 157  
158 158 **复现命令(四场景×四并发)**:
159 159 ```bash
160 160 cd /data/saas-search
161   -.venv/bin/python scripts/perf_api_benchmark.py \
  161 +.venv/bin/python benchmarks/perf_api_benchmark.py \
162 162 --scenario backend_search,backend_suggest,embed_text,rerank \
163 163 --concurrency-list 1,5,10,20 \
164 164 --duration 20 \
... ... @@ -188,7 +188,7 @@ cd /data/saas-search
188 188  
189 189 口径:query 固定 `wireless mouse`,每次请求 **386 docs**,句长 15–40 词随机(从 1000 词池采样);配置 `rerank_window=384`。复现命令:
190 190 ```bash
191   -.venv/bin/python scripts/perf_api_benchmark.py \
  191 +.venv/bin/python benchmarks/perf_api_benchmark.py \
192 192 --scenario rerank --duration 20 --concurrency-list 1,5,10,20 --timeout 60 \
193 193 --rerank-dynamic-docs --rerank-doc-count 386 --rerank-vocab-size 1000 \
194 194 --rerank-sentence-min-words 15 --rerank-sentence-max-words 40 \
... ... @@ -217,7 +217,7 @@ cd /data/saas-search
217 217 | 10 | 181 | 100% | 8.78 | 1129.23| 1295.88| 1330.96|
218 218 | 20 | 161 | 100% | 7.63 | 2594.00| 4706.44| 4783.05|
219 219  
220   -**结论**:吞吐约 **8 rps** 平台化,延迟随并发上升明显,符合“检索 + 向量 + 重排”重链路特征。多租户补测(文档数 500–10000,见报告 §12)表明:文档数越大,RPS 下降、延迟升高;tenant 0(10000 doc)在并发 20 出现部分 ReadTimeout(成功率 59.02%),需注意 timeout 与容量规划;补测命令示例:`for t in 0 1 2 3 4; do .venv/bin/python scripts/perf_api_benchmark.py --scenario backend_search --concurrency-list 1,5,10,20 --duration 20 --tenant-id $t --output perf_reports/2026-03-12/search_tenant_matrix/tenant_${t}.json; done`。
  220 +**结论**:吞吐约 **8 rps** 平台化,延迟随并发上升明显,符合“检索 + 向量 + 重排”重链路特征。多租户补测(文档数 500–10000,见报告 §12)表明:文档数越大,RPS 下降、延迟升高;tenant 0(10000 doc)在并发 20 出现部分 ReadTimeout(成功率 59.02%),需注意 timeout 与容量规划;补测命令示例:`for t in 0 1 2 3 4; do .venv/bin/python benchmarks/perf_api_benchmark.py --scenario backend_search --concurrency-list 1,5,10,20 --duration 20 --tenant-id $t --output perf_reports/2026-03-12/search_tenant_matrix/tenant_${t}.json; done`。
221 221  
222 222 ---
223 223  
... ... @@ -247,5 +247,5 @@ cd /data/saas-search
247 247  
248 248 **关键文件与复现**:
249 249 - 配置:`config/config.yaml`(services、rerank、query_config)、`.env`(端口与 API Key)。
250   -- 脚本:`scripts/service_ctl.sh`(启停与监控)、`scripts/perf_api_benchmark.py`(压测)、`scripts/build_suggestions.sh`(suggest 构建)。
  250 +- 脚本:`scripts/service_ctl.sh`(启停与监控)、`benchmarks/perf_api_benchmark.py`(压测)、`scripts/build_suggestions.sh`(suggest 构建)。
251 251 - 完整步骤与多租户/rerank 对比见:`docs/性能测试报告.md`。
... ...
docs/常用查询 - ES.md
1   -
2   -
3 1 ## Elasticsearch 排查流程
4 2  
  3 +使用前加载环境变量:
  4 +```bash
  5 +set -a; source .env; set +a
  6 +# 或直接 export
  7 +export ES_AUTH="saas:4hOaLaf41y2VuI8y"
  8 +export ES="http://127.0.0.1:9200"
  9 +```
  10 +
5 11 ### 1. 集群健康状态
6 12  
7 13 ```bash
8 14 # 集群整体健康(green / yellow / red)
9   -curl -s -u 'saas:4hOaLaf41y2VuI8y' 'http://127.0.0.1:9200/_cluster/health?pretty'
  15 +curl -s -u "$ES_AUTH" 'http://127.0.0.1:9200/_cluster/health?pretty'
10 16 ```
11 17  
12 18 ### 2. 索引概览
13 19  
14 20 ```bash
15 21 # 查看所有租户索引状态与体积
16   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/_cat/indices/search_products_tenant_*?v'
  22 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/_cat/indices/search_products_tenant_*?v'
17 23  
18 24 # 或查看全部索引
19   -curl -s -u 'saas:4hOaLaf41y2VuI8y' 'http://127.0.0.1:9200/_cat/indices?v'
  25 +curl -s -u "$ES_AUTH" 'http://127.0.0.1:9200/_cat/indices?v'
20 26 ```
21 27  
22 28 ### 3. 分片分布
23 29  
24 30 ```bash
25 31 # 查看分片在各节点的分布情况
26   -curl -s -u 'saas:4hOaLaf41y2VuI8y' 'http://127.0.0.1:9200/_cat/shards?v'
  32 +curl -s -u "$ES_AUTH" 'http://127.0.0.1:9200/_cat/shards?v'
27 33 ```
28 34  
29 35 ### 4. 分配诊断(如有异常)
30 36  
31 37 ```bash
32 38 # 当 health 非 green 或 shards 状态异常时,定位具体原因
33   -curl -s -u 'saas:4hOaLaf41y2VuI8y' -X POST 'http://127.0.0.1:9200/_cluster/allocation/explain?pretty' \
  39 +curl -s -u "$ES_AUTH" -X POST 'http://127.0.0.1:9200/_cluster/allocation/explain?pretty' \
34 40 -H 'Content-Type: application/json' \
35 41 -d '{"index":"search_products_tenant_163","shard":0,"primary":true}'
36 42 ```
... ... @@ -60,6 +66,54 @@ cat /etc/elasticsearch/elasticsearch.yml
60 66 journalctl -u elasticsearch -f
61 67 ```
62 68  
  69 +### 7. 修改索引级设置(如 BM25 `similarity.default`)
  70 +
  71 +`mappings/search_products.json` 里的 `settings.similarity` 只在**创建新索引**时生效;**已有索引**需先关闭索引,再 `PUT _settings`,最后重新打开。
  72 +
  73 +**适用场景**:调整默认 BM25 的 `b`、`k1`(例如与仓库映射对齐:`b: 0.1`、`k1: 0.3`)。
  74 +
  75 +```bash
  76 +# 按需替换:索引名、账号密码、ES 地址
  77 +INDEX="search_products_tenant_163"
  78 +AUTH="$ES_AUTH"
  79 +ES="http://localhost:9200"
  80 +
  81 +# 1) 关闭索引(写入类请求会失败,注意维护窗口)
  82 +curl -s -u "$AUTH" -X POST "$ES/${INDEX}/_close"
  83 +
  84 +# 2) 更新设置(仅示例:与 mappings 中 default 一致时可照抄)
  85 +curl -s -u "$AUTH" -X PUT "$ES/${INDEX}/_settings" \
  86 + -H 'Content-Type: application/json' \
  87 + -d '{
  88 + "index": {
  89 + "similarity": {
  90 + "default": {
  91 + "type": "BM25",
  92 + "b": 0.1,
  93 + "k1": 0.3
  94 + }
  95 + }
  96 + }
  97 +}'
  98 +
  99 +# 3) 重新打开索引
  100 +curl -s -u "$AUTH" -X POST "$ES/${INDEX}/_open"
  101 +```
  102 +
  103 +**检查是否生效**:
  104 +
  105 +```bash
  106 +curl -s -u "$AUTH" -X GET "$ES/${INDEX}/_settings?filter_path=**.similarity&pretty"
  107 +```
  108 +
  109 +期望在响应中看到 `similarity.default` 的 `type`、`b`、`k1`(API 可能将数值以字符串形式返回,属正常)。
  110 +
  111 +**多租户批量**:先列出索引,再对每个 `search_products_tenant_*` 重复上述 close → settings → open。
  112 +
  113 +```bash
  114 +curl -s -u "$AUTH" -X GET "$ES/_cat/indices/search_products_tenant_*?h=index&v"
  115 +```
  116 +
63 117 ---
64 118  
65 119 ### 快速排查路径
... ... @@ -93,7 +147,7 @@ systemctl / df / 日志 → 系统层验证
93 147  
94 148 #### 查询指定 spu_id 的商品(返回 title)
95 149 ```bash
96   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
  150 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
97 151 "size": 11,
98 152 "_source": ["title"],
99 153 "query": {
... ... @@ -108,7 +162,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
108 162  
109 163 #### 查询所有商品(返回 title)
110 164 ```bash
111   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
  165 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
112 166 "size": 100,
113 167 "_source": ["title"],
114 168 "query": {
... ... @@ -119,7 +173,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
119 173  
120 174 #### 查询指定 spu_id 的商品(返回 title、keywords、tags)
121 175 ```bash
122   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
  176 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
123 177 "size": 5,
124 178 "_source": ["title", "keywords", "tags"],
125 179 "query": {
... ... @@ -134,7 +188,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
134 188  
135 189 #### 组合查询:匹配标题 + 过滤标签
136 190 ```bash
137   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
  191 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
138 192 "size": 1,
139 193 "_source": ["title", "keywords", "tags"],
140 194 "query": {
... ... @@ -158,7 +212,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
158 212  
159 213 #### 组合查询:匹配标题 + 过滤租户(冗余示例)
160 214 ```bash
161   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
  215 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
162 216 "size": 1,
163 217 "_source": ["title"],
164 218 "query": {
... ... @@ -186,7 +240,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
186 240  
187 241 #### 测试 index_ik 分析器
188 242 ```bash
189   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_analyze' -H 'Content-Type: application/json' -d '{
  243 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_analyze' -H 'Content-Type: application/json' -d '{
190 244 "analyzer": "index_ik",
191 245 "text": "14寸第4代-眼珠实身冰雪公仔带手动大推车,搪胶雪宝宝"
192 246 }'
... ... @@ -194,7 +248,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
194 248  
195 249 #### 测试 query_ik 分析器
196 250 ```bash
197   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_analyze' -H 'Content-Type: application/json' -d '{
  251 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_analyze' -H 'Content-Type: application/json' -d '{
198 252 "analyzer": "query_ik",
199 253 "text": "14寸第4代-眼珠实身冰雪公仔带手动大推车,搪胶雪宝宝"
200 254 }'
... ... @@ -206,7 +260,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
206 260  
207 261 #### 多字段匹配 + 聚合(category1、color、size、material)
208 262 ```bash
209   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
  263 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
210 264 "size": 1,
211 265 "from": 0,
212 266 "query": {
... ... @@ -316,7 +370,7 @@ GET /search_products_tenant_2/_search
316 370  
317 371 #### 按 spu_id 查询(通用索引)
318 372 ```bash
319   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products/_search?pretty' -H 'Content-Type: application/json' -d '{
  373 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products/_search?pretty' -H 'Content-Type: application/json' -d '{
320 374 "size": 5,
321 375 "query": {
322 376 "bool": {
... ... @@ -333,7 +387,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products/_s
333 387 ### 5. 统计租户总文档数
334 388  
335 389 ```bash
336   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_count?pretty' -H 'Content-Type: application/json' -d '{
  390 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_count?pretty' -H 'Content-Type: application/json' -d '{
337 391 "query": {
338 392 "match_all": {}
339 393 }
... ... @@ -348,7 +402,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
348 402  
349 403 #### 1.1 查询特定租户的商品,显示分面相关字段
350 404 ```bash
351   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  405 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
352 406 "query": {
353 407 "term": { "tenant_id": "162" }
354 408 },
... ... @@ -363,7 +417,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
363 417  
364 418 #### 1.2 验证 category1_name 字段是否有数据
365 419 ```bash
366   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  420 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
367 421 "query": {
368 422 "bool": {
369 423 "filter": [
... ... @@ -378,7 +432,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
378 432  
379 433 #### 1.3 验证 specifications 字段是否有数据
380 434 ```bash
381   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  435 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
382 436 "query": {
383 437 "bool": {
384 438 "filter": [
... ... @@ -397,7 +451,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
397 451  
398 452 #### 2.1 category1_name 分面聚合
399 453 ```bash
400   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  454 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
401 455 "query": { "match_all": {} },
402 456 "size": 0,
403 457 "aggs": {
... ... @@ -410,7 +464,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
410 464  
411 465 #### 2.2 specifications.color 分面聚合
412 466 ```bash
413   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  467 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
414 468 "query": { "match_all": {} },
415 469 "size": 0,
416 470 "aggs": {
... ... @@ -431,7 +485,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
431 485  
432 486 #### 2.3 specifications.size 分面聚合
433 487 ```bash
434   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  488 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
435 489 "query": { "match_all": {} },
436 490 "size": 0,
437 491 "aggs": {
... ... @@ -452,7 +506,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
452 506  
453 507 #### 2.4 specifications.material 分面聚合
454 508 ```bash
455   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  509 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
456 510 "query": { "match_all": {} },
457 511 "size": 0,
458 512 "aggs": {
... ... @@ -473,7 +527,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
473 527  
474 528 #### 2.5 综合分面聚合(category + color + size + material)
475 529 ```bash
476   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  530 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
477 531 "query": { "match_all": {} },
478 532 "size": 0,
479 533 "aggs": {
... ... @@ -515,7 +569,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
515 569  
516 570 #### 3.1 查看 specifications 的 name 字段有哪些值
517 571 ```bash
518   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products/_search?pretty' -H 'Content-Type: application/json' -d '{
  572 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products/_search?pretty' -H 'Content-Type: application/json' -d '{
519 573 "query": { "term": { "tenant_id": "162" } },
520 574 "size": 0,
521 575 "aggs": {
... ... @@ -531,7 +585,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products/_s
531 585  
532 586 #### 3.2 查看某个商品的完整 specifications 数据
533 587 ```bash
534   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products/_search?pretty' -H 'Content-Type: application/json' -d '{
  588 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products/_search?pretty' -H 'Content-Type: application/json' -d '{
535 589 "query": {
536 590 "bool": {
537 591 "filter": [
... ... @@ -552,7 +606,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products/_s
552 606 **keyword 精确匹配**(示例词:中文 `法式风格`,英文 `long skirt`)
553 607  
554 608 ```bash
555   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  609 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
556 610 "size": 1,
557 611 "_source": ["spu_id", "title", "enriched_attributes"],
558 612 "query": {
... ... @@ -575,7 +629,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
575 629 **text 全文匹配**(经 `index_ik` / `english` 分词;可与上式对照)
576 630  
577 631 ```bash
578   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  632 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
579 633 "size": 1,
580 634 "_source": ["spu_id", "title", "enriched_attributes"],
581 635 "query": {
... ... @@ -602,7 +656,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
602 656 **keyword 精确匹配**
603 657  
604 658 ```bash
605   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  659 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
606 660 "size": 1,
607 661 "_source": ["spu_id", "title", "option1_values"],
608 662 "query": {
... ... @@ -620,7 +674,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
620 674 **text 全文匹配**
621 675  
622 676 ```bash
623   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  677 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
624 678 "size": 1,
625 679 "_source": ["spu_id", "title", "option1_values"],
626 680 "query": {
... ... @@ -640,7 +694,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
640 694 **keyword 精确匹配**
641 695  
642 696 ```bash
643   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  697 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
644 698 "size": 1,
645 699 "_source": ["spu_id", "title", "enriched_tags"],
646 700 "query": {
... ... @@ -658,7 +712,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
658 712 **text 全文匹配**
659 713  
660 714 ```bash
661   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  715 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
662 716 "size": 1,
663 717 "_source": ["spu_id", "title", "enriched_tags"],
664 718 "query": {
... ... @@ -678,7 +732,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
678 732 > `specifications` 为 **nested**,`value_keyword` 为整词匹配;`value_text.*` 可同时 `term` 子字段或 `match` 主 text。
679 733  
680 734 ```bash
681   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  735 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
682 736 "size": 1,
683 737 "_source": ["spu_id", "title", "specifications"],
684 738 "query": {
... ... @@ -710,7 +764,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
710 764  
711 765 #### 4.1 统计有 category1_name 的文档数量
712 766 ```bash
713   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_count?pretty' -H 'Content-Type: application/json' -d '{
  767 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_count?pretty' -H 'Content-Type: application/json' -d '{
714 768 "query": {
715 769 "bool": {
716 770 "filter": [
... ... @@ -723,7 +777,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
723 777  
724 778 #### 4.2 统计有 specifications 的文档数量
725 779 ```bash
726   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_count?pretty' -H 'Content-Type: application/json' -d '{
  780 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_count?pretty' -H 'Content-Type: application/json' -d '{
727 781 "query": {
728 782 "bool": {
729 783 "filter": [
... ... @@ -740,7 +794,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
740 794  
741 795 #### 5.1 查找没有 category1_name 但有 category 的文档(MySQL 有数据但 ES 没有)
742 796 ```bash
743   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  797 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
744 798 "query": {
745 799 "bool": {
746 800 "filter": [
... ... @@ -758,7 +812,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
758 812  
759 813 #### 5.2 查找有 option 但没有 specifications 的文档(数据转换问题)
760 814 ```bash
761   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  815 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
762 816 "query": {
763 817 "bool": {
764 818 "filter": [
... ... @@ -814,7 +868,7 @@ GET search_products_tenant_163/_mapping
814 868 GET search_products_tenant_163/_field_caps?fields=*
815 869  
816 870 ```bash
817   -curl -u 'saas:4hOaLaf41y2VuI8y' -X POST \
  871 +curl -u "$ES_AUTH" -X POST \
818 872 'http://localhost:9200/search_products_tenant_163/_count' \
819 873 -H 'Content-Type: application/json' \
820 874 -d '{
... ... @@ -827,7 +881,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X POST \
827 881 }
828 882 }'
829 883  
830   -curl -u 'saas:4hOaLaf41y2VuI8y' -X POST \
  884 +curl -u "$ES_AUTH" -X POST \
831 885 'http://localhost:9200/search_products_tenant_163/_count' \
832 886 -H 'Content-Type: application/json' \
833 887 -d '{
... ...
docs/性能测试报告.md
... ... @@ -18,13 +18,13 @@
18 18  
19 19 执行方式:
20 20 - 每组压测持续 `20s`
21   -- 使用统一脚本 `scripts/perf_api_benchmark.py`
  21 +- 使用统一脚本 `benchmarks/perf_api_benchmark.py`
22 22 - 通过 `--scenario` 多值 + `--concurrency-list` 一次性跑完 `场景 x 并发`
23 23  
24 24 ## 3. 压测工具优化说明(复用现有脚本)
25 25  
26 26 为了解决原脚本“一次只能跑一个场景+一个并发”的可用性问题,本次直接扩展现有脚本:
27   -- `scripts/perf_api_benchmark.py`
  27 +- `benchmarks/perf_api_benchmark.py`
28 28  
29 29 能力:
30 30 - 一条命令执行 `场景列表 x 并发列表` 全矩阵
... ... @@ -33,7 +33,7 @@
33 33 示例:
34 34  
35 35 ```bash
36   -.venv/bin/python scripts/perf_api_benchmark.py \
  36 +.venv/bin/python benchmarks/perf_api_benchmark.py \
37 37 --scenario backend_search,backend_suggest,embed_text,rerank \
38 38 --concurrency-list 1,5,10,20 \
39 39 --duration 20 \
... ... @@ -106,7 +106,7 @@ curl -sS http://127.0.0.1:6007/health
106 106  
107 107 ```bash
108 108 cd /data/saas-search
109   -.venv/bin/python scripts/perf_api_benchmark.py \
  109 +.venv/bin/python benchmarks/perf_api_benchmark.py \
110 110 --scenario backend_search,backend_suggest,embed_text,rerank \
111 111 --concurrency-list 1,5,10,20 \
112 112 --duration 20 \
... ... @@ -164,7 +164,7 @@ cd /data/saas-search
164 164 复现命令:
165 165  
166 166 ```bash
167   -.venv/bin/python scripts/perf_api_benchmark.py \
  167 +.venv/bin/python benchmarks/perf_api_benchmark.py \
168 168 --scenario rerank \
169 169 --duration 20 \
170 170 --concurrency-list 1,5,10,20 \
... ... @@ -237,7 +237,7 @@ cd /data/saas-search
237 237 - 使用项目虚拟环境执行:
238 238  
239 239 ```bash
240   -.venv/bin/python scripts/perf_api_benchmark.py -h
  240 +.venv/bin/python benchmarks/perf_api_benchmark.py -h
241 241 ```
242 242  
243 243 ### 10.3 某场景成功率下降
... ... @@ -249,7 +249,7 @@ cd /data/saas-search
249 249  
250 250 ## 11. 关联文件
251 251  
252   -- 压测脚本:`scripts/perf_api_benchmark.py`
  252 +- 压测脚本:`benchmarks/perf_api_benchmark.py`
253 253 - 本次结果:`perf_reports/2026-03-12/perf_matrix_report.json`
254 254 - Search 多租户补测:`perf_reports/2026-03-12/search_tenant_matrix/`
255 255 - Reranker 386 docs 口径补测:`perf_reports/2026-03-12/rerank_realistic/rerank_386docs.json`
... ... @@ -280,7 +280,7 @@ cd /data/saas-search
280 280 cd /data/saas-search
281 281 mkdir -p perf_reports/2026-03-12/search_tenant_matrix
282 282 for t in 0 1 2 3 4; do
283   - .venv/bin/python scripts/perf_api_benchmark.py \
  283 + .venv/bin/python benchmarks/perf_api_benchmark.py \
284 284 --scenario backend_search \
285 285 --concurrency-list 1,5,10,20 \
286 286 --duration 20 \
... ...
docs/搜索API对接指南-00-总览与快速开始.md
... ... @@ -90,7 +90,7 @@ curl -X POST &quot;http://43.166.252.75:6002/search/&quot; \
90 90 | 查询文档 | POST | `/indexer/documents` | 查询SPU文档数据(不写入ES) |
91 91 | 构建ES文档(正式对接) | POST | `/indexer/build-docs` | 基于上游提供的 MySQL 行数据构建 ES doc,不写入 ES,供 Java 等调用后自行写入 |
92 92 | 构建ES文档(测试用) | POST | `/indexer/build-docs-from-db` | 仅在测试/调试时使用,根据 `tenant_id + spu_ids` 内部查库并构建 ES doc |
93   -| 内容理解字段生成 | POST | `/indexer/enrich-content` | 根据商品标题批量生成 qanchors、enriched_attributes、tags,供微服务组合方式使用 |
  93 +| 内容理解字段生成 | POST | `/indexer/enrich-content` | 根据商品标题批量生成 qanchors、enriched_attributes、tags,供微服务组合方式使用(独立服务端口 6001) |
94 94 | 索引健康检查 | GET | `/indexer/health` | 检查索引服务状态 |
95 95 | 健康检查 | GET | `/admin/health` | 服务健康检查 |
96 96 | 获取配置 | GET | `/admin/config` | 获取租户配置 |
... ... @@ -104,7 +104,6 @@ curl -X POST &quot;http://43.166.252.75:6002/search/&quot; \
104 104 | 向量服务(图片) | 6008 | `POST /embed/image` | 图片向量化 |
105 105 | 翻译服务 | 6006 | `POST /translate` | 文本翻译(支持 qwen-mt / llm / deepl / 本地模型) |
106 106 | 重排服务 | 6007 | `POST /rerank` | 检索结果重排 |
107   -| 内容理解(Indexer 内) | 6004 | `POST /indexer/enrich-content` | 根据商品标题生成 qanchors、tags 等,供 indexer 微服务组合方式使用 |
  107 +| 内容理解(独立服务) | 6001 | `POST /indexer/enrich-content` | 根据商品标题生成 qanchors、tags 等,供 indexer 微服务组合方式使用 |
108 108  
109 109 ---
110   -
... ...
docs/搜索API对接指南-05-索引接口(Indexer).md
... ... @@ -13,7 +13,7 @@
13 13 | 查询文档 | POST | `/indexer/documents` | 按 SPU ID 列表查询 ES 文档,不写入 ES |
14 14 | 构建 ES 文档(正式) | POST | `/indexer/build-docs` | 由上游提供 MySQL 行数据,返回 ES-ready 文档,不写 ES |
15 15 | 构建 ES 文档(测试) | POST | `/indexer/build-docs-from-db` | 由本服务查库并构建文档,仅测试/调试用 |
16   -| 内容理解字段生成 | POST | `/indexer/enrich-content` | 根据商品标题批量生成 qanchors、enriched_attributes、tags(供微服务组合方式使用) |
  16 +| 内容理解字段生成 | POST | `/indexer/enrich-content` | 根据商品标题批量生成 qanchors、enriched_attributes、tags(供微服务组合方式使用;独立服务端口 6001) |
17 17 | 索引健康检查 | GET | `/indexer/health` | 检查索引服务与数据库连接状态 |
18 18  
19 19 #### 5.0 支撑外部 indexer 的三种方式
... ... @@ -23,7 +23,7 @@
23 23 | 方式 | 说明 | 适用场景 |
24 24 |------|------|----------|
25 25 | **1)doc 填充接口** | 调用 `POST /indexer/build-docs` 或 `POST /indexer/build-docs-from-db`,由本服务基于 MySQL 行数据构建完整 ES 文档(含多语言、向量、规格等),**不写入 ES**,由调用方自行写入。 | 希望一站式拿到 ES-ready doc,由己方控制写 ES 的时机与索引名。 |
26   -| **2)微服务组合** | 单独调用**翻译**、**向量化**、**内容理解字段生成**等接口,由 indexer 程序自己组装 doc 并写入 ES。翻译与向量化为独立微服务(见第 7 节);内容理解为 Indexer 服务内接口 `POST /indexer/enrich-content`。 | 需要灵活编排、或希望将 LLM/向量等耗时步骤与主链路解耦(如异步补齐 qanchors/tags)。 |
  26 +| **2)微服务组合** | 单独调用**翻译**、**向量化**、**内容理解字段生成**等接口,由 indexer 程序自己组装 doc 并写入 ES。翻译与向量化为独立微服务(见第 7 节);内容理解为 FacetAwareMatching 独立服务接口 `POST /indexer/enrich-content`(端口 6001)。 | 需要灵活编排、或希望将 LLM/向量等耗时步骤与主链路解耦(如异步补齐 qanchors/tags)。 |
27 27 | **3)本服务直接写 ES** | 调用全量索引 `POST /indexer/reindex`、增量索引 `POST /indexer/index`(指定 SPU ID 列表),由本服务从 MySQL 拉数并直接写入 ES。 | 自建运维、联调或不需要由 Java 写 ES 的场景。 |
28 28  
29 29 - **方式 1** 与 **方式 2** 下,ES 的写入方均为外部 indexer(或 Java),职责清晰。
... ... @@ -498,7 +498,7 @@ curl -X GET &quot;http://localhost:6004/indexer/health&quot;
498 498  
499 499 #### 请求示例(完整 curl)
500 500  
501   -> 完整请求体参考 `scripts/test_build_docs_api.py` 中的 `build_sample_request()`。
  501 +> 完整请求体参考 `tests/manual/test_build_docs_api.py` 中的 `build_sample_request()`。
502 502  
503 503 ```bash
504 504 # 单条 SPU 示例(含 spu、skus、options)
... ... @@ -648,13 +648,38 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \
648 648 ### 5.8 内容理解字段生成接口
649 649  
650 650 - **端点**: `POST /indexer/enrich-content`
651   -- **描述**: 根据商品内容信息批量生成 **qanchors**(锚文本)、**enriched_attributes**(语义属性)、**enriched_tags**(细分标签),供外部 indexer 在「微服务组合」方式下自行拼装 doc 时使用。请求以 `items[]` 传入商品内容字段(必填/可选见下表)。接口只暴露商品内容输入,语言选择、分析维度与最终字段结构统一由 `indexer.product_enrich` 内部决定;当前返回结果与 `search_products` mapping 保持一致。单次请求在线程池中执行,避免阻塞其他接口。
  651 +- **服务**: FacetAwareMatching 独立服务(默认端口 **6001**;由 `/data/FacetAwareMatching/scripts/service_ctl.sh` 管理)
  652 +- **描述**: 根据商品内容信息批量生成 **qanchors**(锚文本)、**enriched_attributes**(通用语义属性)、**enriched_tags**(细分标签)、**enriched_taxonomy_attributes**(taxonomy 结构化属性),供外部 indexer 在「微服务组合」方式下自行拼装 doc 时使用。请求以 `items[]` 传入商品内容字段(必填/可选见下表)。接口只暴露商品内容输入,语言选择、分析维度与最终字段结构统一由 FacetAwareMatching 的 `product_enrich` 内部决定;当前返回结果与 `search_products` mapping 保持一致。单次请求在线程池中执行,避免阻塞其他接口。
  653 +
  654 +当前支持的 `category_taxonomy_profile`:
  655 +- `apparel`
  656 +- `3c`
  657 +- `bags`
  658 +- `pet_supplies`
  659 +- `electronics`
  660 +- `outdoor`
  661 +- `home_appliances`
  662 +- `home_living`
  663 +- `wigs`
  664 +- `beauty`
  665 +- `accessories`
  666 +- `toys`
  667 +- `shoes`
  668 +- `sports`
  669 +- `others`
  670 +
  671 +说明:
  672 +- 所有 profile 的 `enriched_taxonomy_attributes.value` 都统一返回 `zh` + `en`。
  673 +- 外部调用 `/indexer/enrich-content` 时,以请求中的 `category_taxonomy_profile` 为准。
  674 +- 若 indexer 内部仍接入内容理解能力,taxonomy profile 请在调用侧显式传入(建议仍以租户行业配置为准)。
652 675  
653 676 #### 请求参数
654 677  
655 678 ```json
656 679 {
657 680 "tenant_id": "170",
  681 + "enrichment_scopes": ["generic", "category_taxonomy"],
  682 + "category_taxonomy_profile": "apparel",
658 683 "items": [
659 684 {
660 685 "spu_id": "223167",
... ... @@ -675,6 +700,8 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \
675 700 | 参数 | 类型 | 必填 | 默认值 | 说明 |
676 701 |------|------|------|--------|------|
677 702 | `tenant_id` | string | Y | - | 租户 ID。目前仅用于记录日志,不产生实际作用|
  703 +| `enrichment_scopes` | array[string] | N | `["generic", "category_taxonomy"]` | 选择要执行的增强范围。`generic` 生成 `qanchors`/`enriched_tags`/`enriched_attributes`,`category_taxonomy` 生成 `enriched_taxonomy_attributes` |
  704 +| `category_taxonomy_profile` | string | N | `apparel` | 品类 taxonomy profile。支持:`apparel`、`3c`、`bags`、`pet_supplies`、`electronics`、`outdoor`、`home_appliances`、`home_living`、`wigs`、`beauty`、`accessories`、`toys`、`shoes`、`sports`、`others` |
678 705 | `items` | array | Y | - | 待分析列表;**单次最多 50 条** |
679 706  
680 707 `items[]` 字段说明:
... ... @@ -683,21 +710,24 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \
683 710 |------|------|------|------|
684 711 | `spu_id` | string | Y | SPU ID,用于回填结果;目前仅用于记录日志,不产生实际作用|
685 712 | `title` | string | Y | 商品标题 |
686   -| `image_url` | string | N | 商品主图 URL;当前会参与内容缓存键,后续可用于图像/多模态内容理解 |
687   -| `brief` | string | N | 商品简介/短描述;当前会参与内容缓存键 |
688   -| `description` | string | N | 商品详情/长描述;当前会参与内容缓存键 |
  713 +| `image_url` | string | N | 商品主图 URL;当前仅透传,暂未参与 prompt 与缓存键,后续可用于图像/多模态内容理解 |
  714 +| `brief` | string | N | 商品简介/短描述;当前会参与 prompt 与缓存键 |
  715 +| `description` | string | N | 商品详情/长描述;当前会参与 prompt 与缓存键 |
689 716  
690 717 缓存说明:
691 718  
692   -- 内容缓存键仅由 `target_lang + items[]` 中会影响内容理解结果的输入文本构成,目前包括:`title`、`brief`、`description`、`image_url` 的规范化内容 hash。
  719 +- 内容缓存按 **增强范围 + taxonomy profile** 拆分;`generic` 与 `category_taxonomy:apparel` 等使用不同缓存命名空间,互不污染、可独立演进。
  720 +- 缓存键由 `analysis_kind + target_lang + prompt/schema 版本指纹 + prompt 输入文本 hash` 构成;对 category taxonomy 来说,profile 会进入 schema 标识与版本指纹。
  721 +- 当前真正参与 prompt 输入的字段是:`title`、`brief`、`description`;这些字段任一变化,都会落到新的缓存 key。
  722 +- `prompt/schema 版本指纹` 会综合 system prompt、shared instruction、localized table headers、result fields、user instruction template 等信息生成;因此只要提示词或输出契约变化,旧缓存会自然失效。
693 723 - `tenant_id`、`spu_id` 只用于请求归属与结果回填,不参与缓存键。
694   -- 因此,输入内容不变时可跨请求直接命中缓存;任一输入字段变化时,会自然落到新的缓存 key。
  724 +- 因此,输入内容与 prompt 契约都不变时可跨请求直接命中缓存;任一一侧变化,都会自然落到新的缓存 key。
695 725  
696 726 语言说明:
697 727  
698 728 - 接口不接受语言控制参数。
699 729 - 返回哪些语言、返回哪些语义维度,统一由 `indexer.product_enrich` 内部逻辑决定。
700   -- 当前为了与 `search_products` mapping 对齐,返回结果只包含核心索引语言 `zh`、`en`。
  730 +- 当前为了与 `search_products` mapping 对齐,通用增强字段与 taxonomy 字段都统一只返回核心索引语言 `zh`、`en`。
701 731  
702 732 批量请求建议:
703 733 - **全量**:强烈建议 尽可能 **20 个 SPU/doc** 攒成一个批次后再请求一次。
... ... @@ -709,6 +739,8 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \
709 739 ```json
710 740 {
711 741 "tenant_id": "170",
  742 + "enrichment_scopes": ["generic", "category_taxonomy"],
  743 + "category_taxonomy_profile": "apparel",
712 744 "total": 2,
713 745 "results": [
714 746 {
... ... @@ -725,6 +757,11 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \
725 757 { "name": "enriched_tags", "value": { "zh": "纯棉" } },
726 758 { "name": "usage_scene", "value": { "zh": "日常" } },
727 759 { "name": "enriched_tags", "value": { "en": "cotton" } }
  760 + ],
  761 + "enriched_taxonomy_attributes": [
  762 + { "name": "Product Type", "value": { "zh": ["T恤"], "en": ["t-shirt"] } },
  763 + { "name": "Target Gender", "value": { "zh": ["男"], "en": ["men"] } },
  764 + { "name": "Season", "value": { "zh": ["夏季"], "en": ["summer"] } }
728 765 ]
729 766 },
730 767 {
... ... @@ -735,7 +772,8 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \
735 772 "enriched_tags": {
736 773 "en": ["dolls", "toys"]
737 774 },
738   - "enriched_attributes": []
  775 + "enriched_attributes": [],
  776 + "enriched_taxonomy_attributes": []
739 777 }
740 778 ]
741 779 }
... ... @@ -743,10 +781,13 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \
743 781  
744 782 | 字段 | 类型 | 说明 |
745 783 |------|------|------|
746   -| `results` | array | 与请求 `items` 一一对应,每项含 `spu_id`、`qanchors`、`enriched_attributes`、`enriched_tags` |
  784 +| `enrichment_scopes` | array | 实际执行的增强范围列表 |
  785 +| `category_taxonomy_profile` | string | 实际使用的品类 taxonomy profile |
  786 +| `results` | array | 与请求 `items` 一一对应,每项含 `spu_id`、`qanchors`、`enriched_attributes`、`enriched_tags`、`enriched_taxonomy_attributes` |
747 787 | `results[].qanchors` | object | 与 ES `qanchors` 字段同结构,按语言键返回短语数组 |
748 788 | `results[].enriched_tags` | object | 与 ES `enriched_tags` 字段同结构,按语言键返回标签数组 |
749 789 | `results[].enriched_attributes` | array | 与 ES `enriched_attributes` nested 字段同结构,每项为 `{ "name", "value": { "zh"?: "...", "en"?: "..." } }` |
  790 +| `results[].enriched_taxonomy_attributes` | array | 与 ES `enriched_taxonomy_attributes` nested 字段同结构。每项通常为 `{ "name", "value": { "zh"?: [...], "en"?: [...] } }` |
750 791 | `results[].error` | string | 若该条处理失败(如 LLM 异常),会在此字段返回错误信息 |
751 792  
752 793 **错误响应**:
... ... @@ -756,10 +797,12 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \
756 797 #### 请求示例
757 798  
758 799 ```bash
759   -curl -X POST "http://localhost:6004/indexer/enrich-content" \
  800 +curl -X POST "http://localhost:6001/indexer/enrich-content" \
760 801 -H "Content-Type: application/json" \
761 802 -d '{
762   - "tenant_id": "170",
  803 + "tenant_id": "163",
  804 + "enrichment_scopes": ["generic", "category_taxonomy"],
  805 + "category_taxonomy_profile": "apparel",
763 806 "items": [
764 807 {
765 808 "spu_id": "223167",
... ... @@ -773,4 +816,3 @@ curl -X POST &quot;http://localhost:6004/indexer/enrich-content&quot; \
773 816 ```
774 817  
775 818 ---
776   -
... ...
docs/搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md
... ... @@ -444,7 +444,7 @@ curl &quot;http://localhost:6006/health&quot;
444 444  
445 445 - **Base URL**: Indexer 服务地址,如 `http://localhost:6004`
446 446 - **路径**: `POST /indexer/enrich-content`
447   -- **说明**: 根据商品标题批量生成 `qanchors`、`enriched_attributes`、`tags`,用于拼装 ES 文档。内部使用大模型(需配置 `DASHSCOPE_API_KEY`),支持多语言与 Redis 缓存;单次最多 50 条,建议批量调用以提升效率。
  447 +- **说明**: 根据商品标题批量生成 `qanchors`、`enriched_attributes`、`enriched_tags`、`enriched_taxonomy_attributes`,用于拼装 ES 文档。支持通过 `enrichment_scopes` 选择执行 `generic` / `category_taxonomy`,并通过 `category_taxonomy_profile` 选择对应大类的 taxonomy prompt/profile;默认执行 `generic + category_taxonomy(apparel)`。当前支持的 taxonomy profile 包括 `apparel`、`3c`、`bags`、`pet_supplies`、`electronics`、`outdoor`、`home_appliances`、`home_living`、`wigs`、`beauty`、`accessories`、`toys`、`shoes`、`sports`、`others`。所有 profile 的 taxonomy 输出都统一返回 `zh` + `en`,`category_taxonomy_profile` 只决定字段集合。内部使用大模型(需配置 `DASHSCOPE_API_KEY`),支持多语言与 Redis 缓存;单次最多 50 条,建议批量调用以提升效率。
448 448  
449 449 请求/响应格式、示例及错误码见 [-05-索引接口(Indexer)](./搜索API对接指南-05-索引接口(Indexer).md#58-内容理解字段生成接口)。
450 450  
... ...
docs/搜索API对接指南-10-接口级压测脚本.md
... ... @@ -4,7 +4,7 @@
4 4  
5 5 ## 10. 接口级压测脚本
6 6  
7   -仓库提供统一压测脚本:`scripts/perf_api_benchmark.py`,用于对以下接口做并发压测:
  7 +仓库提供统一压测脚本:`benchmarks/perf_api_benchmark.py`,用于对以下接口做并发压测:
8 8  
9 9 - 后端搜索:`POST /search/`
10 10 - 搜索建议:`GET /search/suggestions`
... ... @@ -18,21 +18,21 @@
18 18  
19 19 ```bash
20 20 # suggest 压测(tenant 162)
21   -python scripts/perf_api_benchmark.py \
  21 +python benchmarks/perf_api_benchmark.py \
22 22 --scenario backend_suggest \
23 23 --tenant-id 162 \
24 24 --duration 30 \
25 25 --concurrency 50
26 26  
27 27 # search 压测
28   -python scripts/perf_api_benchmark.py \
  28 +python benchmarks/perf_api_benchmark.py \
29 29 --scenario backend_search \
30 30 --tenant-id 162 \
31 31 --duration 30 \
32 32 --concurrency 20
33 33  
34 34 # 全链路压测(search + suggest + embedding + translate + rerank)
35   -python scripts/perf_api_benchmark.py \
  35 +python benchmarks/perf_api_benchmark.py \
36 36 --scenario all \
37 37 --tenant-id 162 \
38 38 --duration 60 \
... ... @@ -45,17 +45,16 @@ python scripts/perf_api_benchmark.py \
45 45 可通过 `--cases-file` 覆盖默认请求模板。示例文件:
46 46  
47 47 ```bash
48   -scripts/perf_cases.json.example
  48 +benchmarks/perf_cases.json.example
49 49 ```
50 50  
51 51 执行示例:
52 52  
53 53 ```bash
54   -python scripts/perf_api_benchmark.py \
  54 +python benchmarks/perf_api_benchmark.py \
55 55 --scenario all \
56 56 --tenant-id 162 \
57   - --cases-file scripts/perf_cases.json.example \
  57 + --cases-file benchmarks/perf_cases.json.example \
58 58 --duration 60 \
59 59 --concurrency 40
60 60 ```
61   -
... ...
docs/相关性检索优化说明.md
... ... @@ -330,7 +330,7 @@ python -m pytest -q tests/test_rerank_client.py tests/test_es_query_builder.py t
330 330 ./scripts/service_ctl.sh restart backend
331 331 sleep 3
332 332 ./scripts/service_ctl.sh status backend
333   -./scripts/evaluation/start_eval.sh.sh batch
  333 +./scripts/evaluation/start_eval.sh batch
334 334 ```
335 335  
336 336 评估产物在 `artifacts/search_evaluation/`(如 `search_eval.sqlite3`、`batch_reports/` 下的 JSON/Markdown)。流程与参数说明见 [scripts/evaluation/README.md](../scripts/evaluation/README.md)。
... ... @@ -895,4 +895,3 @@ rerank_score:0.4784
895 895 rerank_score:0.5849
896 896 "zh": "新款女士修身仿旧牛仔短裤 – 休闲性感磨边水洗牛仔短裤,时尚舒",
897 897 "en": "New Women's Slim-fit Vintage Washed Denim Shorts – Casual Sexy Frayed Hem, Fashionable & Comfortable"
898   -
... ...
docs/缓存与Redis使用说明.md
... ... @@ -196,18 +196,25 @@ services:
196 196 - 配置项:
197 197 - `ANCHOR_CACHE_PREFIX = REDIS_CONFIG.get("anchor_cache_prefix", "product_anchors")`
198 198 - `ANCHOR_CACHE_EXPIRE_DAYS = int(REDIS_CONFIG.get("anchor_cache_expire_days", 30))`
199   -- Key 构造函数:`_make_anchor_cache_key(title, target_lang, tenant_id)`
  199 +- Key 构造函数:`_make_analysis_cache_key(product, target_lang, analysis_kind)`
200 200 - 模板:
201 201  
202 202 ```text
203   -{ANCHOR_CACHE_PREFIX}:{tenant_or_global}:{target_lang}:{md5(title)}
  203 +{ANCHOR_CACHE_PREFIX}:{analysis_kind}:{prompt_contract_hash}:{target_lang}:{prompt_input_prefix}{md5(prompt_input)}
204 204 ```
205 205  
206 206 - 字段说明:
207 207 - `ANCHOR_CACHE_PREFIX`:默认 `"product_anchors"`,可通过 `.env` 中的 `REDIS_ANCHOR_CACHE_PREFIX`(若存在)间接配置到 `REDIS_CONFIG`;
208   - - `tenant_or_global`:`tenant_id` 去空白后的字符串,若为空则使用 `"global"`;
  208 + - `analysis_kind`:分析族,目前至少包括 `content` 与 `taxonomy`,两者缓存隔离;
  209 + - `prompt_contract_hash`:基于 system prompt、shared instruction、localized headers、result fields、user instruction template、schema cache version 等生成的短 hash;只要提示词或输出契约变化,缓存会自动失效;
209 210 - `target_lang`:内容理解输出语言,例如 `zh`;
210   - - `md5(title)`:对原始商品标题(UTF-8)做 MD5。
  211 + - `prompt_input_prefix + md5(prompt_input)`:对真正送入 prompt 的商品文本做前缀 + MD5;当前 prompt 输入来自 `title`、`brief`、`description` 的规范化拼接结果。
  212 +
  213 +设计原则:
  214 +
  215 +- 只让**实际影响 LLM 输出**的输入参与 key;
  216 +- 不让 `tenant_id`、`spu_id` 这类“结果归属信息”污染缓存;
  217 +- prompt 或 schema 变更时,不依赖人工清理 Redis,也能自然切换到新 key。
211 218  
212 219 ### 4.2 Value 与类型
213 220  
... ... @@ -229,6 +236,7 @@ services:
229 236 ```
230 237  
231 238 - 读取时通过 `json.loads(raw)` 还原为 `Dict[str, Any]`。
  239 +- `content` 与 `taxonomy` 的 value 结构会随各自 schema 不同而不同,但都会先通过统一的 normalize 逻辑再写缓存。
232 240  
233 241 ### 4.3 过期策略
234 242  
... ...
embeddings/README.md
... ... @@ -98,10 +98,10 @@
98 98  
99 99 ### 性能与压测(沿用仓库脚本)
100 100  
101   -- 接口级压测(与 `perf_reports/2026-03-12/matrix_report/` 等方法一致):`scripts/perf_api_benchmark.py`
102   - - 示例:`python scripts/perf_api_benchmark.py --scenario embed_text --duration 30 --concurrency 20`
  101 +- 接口级压测(与 `perf_reports/2026-03-12/matrix_report/` 等方法一致):`benchmarks/perf_api_benchmark.py`
  102 + - 示例:`python benchmarks/perf_api_benchmark.py --scenario embed_text --duration 30 --concurrency 20`
103 103 - 文本/图片向量可带 `priority`(与线上 admission 语义一致):`--embed-text-priority 1`、`--embed-image-priority 1`
104   - - 自定义请求模板:`--cases-file scripts/perf_cases.json.example`
  104 + - 自定义请求模板:`--cases-file benchmarks/perf_cases.json.example`
105 105 - 历史矩阵结果与说明见 `perf_reports/2026-03-12/matrix_report/summary.md`。
106 106  
107 107 ### 启动服务
... ...
frontend/static/js/app.js
... ... @@ -316,7 +316,10 @@ async function performSearch(page = 1) {
316 316 document.getElementById('productGrid').innerHTML = '';
317 317  
318 318 try {
319   - const response = await fetch(`${API_BASE_URL}/search/`, {
  319 + const searchUrl = new URL(`${API_BASE_URL}/search/`, window.location.origin);
  320 + searchUrl.searchParams.set('tenant_id', tenantId);
  321 +
  322 + const response = await fetch(searchUrl.toString(), {
320 323 method: 'POST',
321 324 headers: {
322 325 'Content-Type': 'application/json',
... ...
indexer/README.md
... ... @@ -8,7 +8,7 @@
8 8  
9 9 ### 1.1 系统角色划分
10 10  
11   -- **Java 索引程序(/home/tw/saas-server)**
  11 +- **Java 索引程序**
12 12 - 负责“**什么时候、对哪些 SPU 做索引**”(调度 & 触发)。
13 13 - 负责**商品/店铺/类目等基础数据同步**(写 MySQL)。
14 14 - 负责**多租户环境下的全量/增量索引调度**,但不再关心具体 doc 字段细节。
... ...
indexer/Untitled 0 → 100644
... ... @@ -0,0 +1 @@
  1 +taxonomy
0 2 \ No newline at end of file
... ...
indexer/document_transformer.py
... ... @@ -242,6 +242,7 @@ class SPUDocumentTransformer:
242 242 - qanchors.{lang}
243 243 - enriched_tags.{lang}
244 244 - enriched_attributes[].value.{lang}
  245 + - enriched_taxonomy_attributes[].value.{lang}
245 246  
246 247 设计目标:
247 248 - 尽可能攒批调用 LLM;
... ... @@ -273,7 +274,12 @@ class SPUDocumentTransformer:
273 274  
274 275 tenant_id = str(docs[0].get("tenant_id") or "").strip() or None
275 276 try:
276   - results = build_index_content_fields(items=items, tenant_id=tenant_id)
  277 + # TODO: 从数据库读取该 tenant 的真实行业,并据此替换当前默认的 apparel profile。
  278 + results = build_index_content_fields(
  279 + items=items,
  280 + tenant_id=tenant_id,
  281 + category_taxonomy_profile="apparel",
  282 + )
277 283 except Exception as e:
278 284 logger.warning("LLM batch attribute fill failed: %s", e)
279 285 return
... ... @@ -296,6 +302,8 @@ class SPUDocumentTransformer:
296 302 doc["enriched_tags"] = enrichment["enriched_tags"]
297 303 if enrichment.get("enriched_attributes"):
298 304 doc["enriched_attributes"] = enrichment["enriched_attributes"]
  305 + if enrichment.get("enriched_taxonomy_attributes"):
  306 + doc["enriched_taxonomy_attributes"] = enrichment["enriched_taxonomy_attributes"]
299 307 except Exception as e:
300 308 logger.warning("Failed to apply enrichment to doc (spu_id=%s): %s", doc.get("spu_id"), e)
301 309  
... ... @@ -666,6 +674,7 @@ class SPUDocumentTransformer:
666 674  
667 675 tenant_id = doc.get("tenant_id")
668 676 try:
  677 + # TODO: 从数据库读取该 tenant 的真实行业,并据此替换当前默认的 apparel profile。
669 678 results = build_index_content_fields(
670 679 items=[
671 680 {
... ... @@ -677,6 +686,7 @@ class SPUDocumentTransformer:
677 686 }
678 687 ],
679 688 tenant_id=str(tenant_id),
  689 + category_taxonomy_profile="apparel",
680 690 )
681 691 except Exception as e:
682 692 logger.warning("LLM attribute fill failed for SPU %s: %s", spu_id, e)
... ...
indexer/product_enrich.py
... ... @@ -14,10 +14,11 @@ import time
14 14 import hashlib
15 15 import uuid
16 16 import threading
  17 +from dataclasses import dataclass, field
17 18 from collections import OrderedDict
18 19 from datetime import datetime
19 20 from concurrent.futures import ThreadPoolExecutor
20   -from typing import List, Dict, Tuple, Any, Optional
  21 +from typing import List, Dict, Tuple, Any, Optional, FrozenSet
21 22  
22 23 import redis
23 24 import requests
... ... @@ -30,6 +31,7 @@ from indexer.product_enrich_prompts import (
30 31 USER_INSTRUCTION_TEMPLATE,
31 32 LANGUAGE_MARKDOWN_TABLE_HEADERS,
32 33 SHARED_ANALYSIS_INSTRUCTION,
  34 + CATEGORY_TAXONOMY_PROFILES,
33 35 )
34 36  
35 37 # 配置
... ... @@ -144,10 +146,26 @@ if _missing_prompt_langs:
144 146 )
145 147  
146 148  
147   -# 多值字段分隔:英文逗号、中文逗号、顿号,及历史约定的 ; | / 与空白
  149 +# 多值字段分隔
148 150 _MULTI_VALUE_FIELD_SPLIT_RE = re.compile(r"[,、,;|/\n\t]+")
  151 +# 表格单元格中视为「无内容」的占位
  152 +_MARKDOWN_EMPTY_CELL_LITERALS: Tuple[str, ...] = ("-","–", "—", "none", "null", "n/a", "无")
  153 +_MARKDOWN_EMPTY_CELL_TOKENS_CF: FrozenSet[str] = frozenset(
  154 + lit.casefold() for lit in _MARKDOWN_EMPTY_CELL_LITERALS
  155 +)
  156 +
  157 +def _normalize_markdown_table_cell(raw: Optional[str]) -> str:
  158 + """strip;将占位符统一视为空字符串。"""
  159 + s = str(raw or "").strip()
  160 + if not s:
  161 + return ""
  162 + if s.casefold() in _MARKDOWN_EMPTY_CELL_TOKENS_CF:
  163 + return ""
  164 + return s
149 165 _CORE_INDEX_LANGUAGES = ("zh", "en")
150   -_ANALYSIS_ATTRIBUTE_FIELD_MAP = (
  166 +_DEFAULT_ENRICHMENT_SCOPES = ("generic", "category_taxonomy")
  167 +_DEFAULT_CATEGORY_TAXONOMY_PROFILE = "apparel"
  168 +_CONTENT_ANALYSIS_ATTRIBUTE_FIELD_MAP = (
151 169 ("tags", "enriched_tags"),
152 170 ("target_audience", "target_audience"),
153 171 ("usage_scene", "usage_scene"),
... ... @@ -156,7 +174,7 @@ _ANALYSIS_ATTRIBUTE_FIELD_MAP = (
156 174 ("material", "material"),
157 175 ("features", "features"),
158 176 )
159   -_ANALYSIS_RESULT_FIELDS = (
  177 +_CONTENT_ANALYSIS_RESULT_FIELDS = (
160 178 "title",
161 179 "category_path",
162 180 "tags",
... ... @@ -168,7 +186,7 @@ _ANALYSIS_RESULT_FIELDS = (
168 186 "features",
169 187 "anchor_text",
170 188 )
171   -_ANALYSIS_MEANINGFUL_FIELDS = (
  189 +_CONTENT_ANALYSIS_MEANINGFUL_FIELDS = (
172 190 "tags",
173 191 "target_audience",
174 192 "usage_scene",
... ... @@ -178,9 +196,111 @@ _ANALYSIS_MEANINGFUL_FIELDS = (
178 196 "features",
179 197 "anchor_text",
180 198 )
181   -_ANALYSIS_FIELD_ALIASES = {
  199 +_CONTENT_ANALYSIS_FIELD_ALIASES = {
182 200 "tags": ("tags", "enriched_tags"),
183 201 }
  202 +_CONTENT_ANALYSIS_QUALITY_FIELDS = ("title", "category_path", "anchor_text")
  203 +
  204 +
  205 +@dataclass(frozen=True)
  206 +class AnalysisSchema:
  207 + name: str
  208 + shared_instruction: str
  209 + markdown_table_headers: Dict[str, List[str]]
  210 + result_fields: Tuple[str, ...]
  211 + meaningful_fields: Tuple[str, ...]
  212 + cache_version: str = "v1"
  213 + field_aliases: Dict[str, Tuple[str, ...]] = field(default_factory=dict)
  214 + quality_fields: Tuple[str, ...] = ()
  215 +
  216 + def get_headers(self, target_lang: str) -> Optional[List[str]]:
  217 + return self.markdown_table_headers.get(target_lang)
  218 +
  219 +
  220 +_ANALYSIS_SCHEMAS: Dict[str, AnalysisSchema] = {
  221 + "content": AnalysisSchema(
  222 + name="content",
  223 + shared_instruction=SHARED_ANALYSIS_INSTRUCTION,
  224 + markdown_table_headers=LANGUAGE_MARKDOWN_TABLE_HEADERS,
  225 + result_fields=_CONTENT_ANALYSIS_RESULT_FIELDS,
  226 + meaningful_fields=_CONTENT_ANALYSIS_MEANINGFUL_FIELDS,
  227 + cache_version="v2",
  228 + field_aliases=_CONTENT_ANALYSIS_FIELD_ALIASES,
  229 + quality_fields=_CONTENT_ANALYSIS_QUALITY_FIELDS,
  230 + ),
  231 +}
  232 +
  233 +def _build_taxonomy_profile_schema(profile: str, config: Dict[str, Any]) -> AnalysisSchema:
  234 + return AnalysisSchema(
  235 + name=f"taxonomy:{profile}",
  236 + shared_instruction=config["shared_instruction"],
  237 + markdown_table_headers=config["markdown_table_headers"],
  238 + result_fields=tuple(field["key"] for field in config["fields"]),
  239 + meaningful_fields=tuple(field["key"] for field in config["fields"]),
  240 + cache_version="v1",
  241 + )
  242 +
  243 +
  244 +_CATEGORY_TAXONOMY_PROFILE_SCHEMAS: Dict[str, AnalysisSchema] = {
  245 + profile: _build_taxonomy_profile_schema(profile, config)
  246 + for profile, config in CATEGORY_TAXONOMY_PROFILES.items()
  247 +}
  248 +
  249 +_CATEGORY_TAXONOMY_PROFILE_ATTRIBUTE_FIELD_MAPS: Dict[str, Tuple[Tuple[str, str], ...]] = {
  250 + profile: tuple((field["key"], field["label"]) for field in config["fields"])
  251 + for profile, config in CATEGORY_TAXONOMY_PROFILES.items()
  252 +}
  253 +
  254 +
  255 +def get_supported_category_taxonomy_profiles() -> Tuple[str, ...]:
  256 + return tuple(_CATEGORY_TAXONOMY_PROFILE_SCHEMAS.keys())
  257 +
  258 +
  259 +def _normalize_category_taxonomy_profile(category_taxonomy_profile: Optional[str] = None) -> str:
  260 + profile = str(category_taxonomy_profile or _DEFAULT_CATEGORY_TAXONOMY_PROFILE).strip()
  261 + if profile not in _CATEGORY_TAXONOMY_PROFILE_SCHEMAS:
  262 + supported = ", ".join(get_supported_category_taxonomy_profiles())
  263 + raise ValueError(
  264 + f"Unsupported category_taxonomy_profile: {profile}. Supported profiles: {supported}"
  265 + )
  266 + return profile
  267 +
  268 +
  269 +def _get_analysis_schema(
  270 + analysis_kind: str,
  271 + *,
  272 + category_taxonomy_profile: Optional[str] = None,
  273 +) -> AnalysisSchema:
  274 + if analysis_kind == "content":
  275 + return _ANALYSIS_SCHEMAS["content"]
  276 + if analysis_kind == "taxonomy":
  277 + profile = _normalize_category_taxonomy_profile(category_taxonomy_profile)
  278 + return _CATEGORY_TAXONOMY_PROFILE_SCHEMAS[profile]
  279 + raise ValueError(f"Unsupported analysis_kind: {analysis_kind}")
  280 +
  281 +
  282 +def _get_taxonomy_attribute_field_map(
  283 + category_taxonomy_profile: Optional[str] = None,
  284 +) -> Tuple[Tuple[str, str], ...]:
  285 + profile = _normalize_category_taxonomy_profile(category_taxonomy_profile)
  286 + return _CATEGORY_TAXONOMY_PROFILE_ATTRIBUTE_FIELD_MAPS[profile]
  287 +
  288 +
  289 +def _normalize_enrichment_scopes(
  290 + enrichment_scopes: Optional[List[str]] = None,
  291 +) -> Tuple[str, ...]:
  292 + requested = _DEFAULT_ENRICHMENT_SCOPES if not enrichment_scopes else tuple(enrichment_scopes)
  293 + normalized: List[str] = []
  294 + seen = set()
  295 + for enrichment_scope in requested:
  296 + scope = str(enrichment_scope).strip()
  297 + if scope not in {"generic", "category_taxonomy"}:
  298 + raise ValueError(f"Unsupported enrichment_scope: {scope}")
  299 + if scope in seen:
  300 + continue
  301 + seen.add(scope)
  302 + normalized.append(scope)
  303 + return tuple(normalized)
184 304  
185 305  
186 306 def split_multi_value_field(text: Optional[str]) -> List[str]:
... ... @@ -235,12 +355,12 @@ def _get_product_id(product: Dict[str, Any]) -&gt; str:
235 355 return str(product.get("id") or product.get("spu_id") or "").strip()
236 356  
237 357  
238   -def _get_analysis_field_aliases(field_name: str) -> Tuple[str, ...]:
239   - return _ANALYSIS_FIELD_ALIASES.get(field_name, (field_name,))
  358 +def _get_analysis_field_aliases(field_name: str, schema: AnalysisSchema) -> Tuple[str, ...]:
  359 + return schema.field_aliases.get(field_name, (field_name,))
240 360  
241 361  
242   -def _get_analysis_field_value(row: Dict[str, Any], field_name: str) -> Any:
243   - for alias in _get_analysis_field_aliases(field_name):
  362 +def _get_analysis_field_value(row: Dict[str, Any], field_name: str, schema: AnalysisSchema) -> Any:
  363 + for alias in _get_analysis_field_aliases(field_name, schema):
244 364 if alias in row:
245 365 return row.get(alias)
246 366 return None
... ... @@ -261,6 +381,7 @@ def _has_meaningful_value(value: Any) -&gt; bool:
261 381 def _make_empty_analysis_result(
262 382 product: Dict[str, Any],
263 383 target_lang: str,
  384 + schema: AnalysisSchema,
264 385 error: Optional[str] = None,
265 386 ) -> Dict[str, Any]:
266 387 result = {
... ... @@ -268,7 +389,7 @@ def _make_empty_analysis_result(
268 389 "lang": target_lang,
269 390 "title_input": str(product.get("title") or "").strip(),
270 391 }
271   - for field in _ANALYSIS_RESULT_FIELDS:
  392 + for field in schema.result_fields:
272 393 result[field] = ""
273 394 if error:
274 395 result["error"] = error
... ... @@ -279,42 +400,59 @@ def _normalize_analysis_result(
279 400 result: Dict[str, Any],
280 401 product: Dict[str, Any],
281 402 target_lang: str,
  403 + schema: AnalysisSchema,
282 404 ) -> Dict[str, Any]:
283   - normalized = _make_empty_analysis_result(product, target_lang)
  405 + normalized = _make_empty_analysis_result(product, target_lang, schema)
284 406 if not isinstance(result, dict):
285 407 return normalized
286 408  
287 409 normalized["lang"] = str(result.get("lang") or target_lang).strip() or target_lang
288   - normalized["title"] = str(result.get("title") or "").strip()
289   - normalized["category_path"] = str(result.get("category_path") or "").strip()
290 410 normalized["title_input"] = str(
291 411 product.get("title") or result.get("title_input") or ""
292 412 ).strip()
293 413  
294   - for field in _ANALYSIS_RESULT_FIELDS:
295   - if field in {"title", "category_path"}:
296   - continue
297   - normalized[field] = str(_get_analysis_field_value(result, field) or "").strip()
  414 + for field in schema.result_fields:
  415 + normalized[field] = str(_get_analysis_field_value(result, field, schema) or "").strip()
298 416  
299 417 if result.get("error"):
300 418 normalized["error"] = str(result.get("error"))
301 419 return normalized
302 420  
303 421  
304   -def _has_meaningful_analysis_content(result: Dict[str, Any]) -> bool:
305   - return any(_has_meaningful_value(result.get(field)) for field in _ANALYSIS_MEANINGFUL_FIELDS)
  422 +def _has_meaningful_analysis_content(result: Dict[str, Any], schema: AnalysisSchema) -> bool:
  423 + return any(_has_meaningful_value(result.get(field)) for field in schema.meaningful_fields)
  424 +
  425 +
  426 +def _append_analysis_attributes(
  427 + target: List[Dict[str, Any]],
  428 + row: Dict[str, Any],
  429 + lang: str,
  430 + schema: AnalysisSchema,
  431 + field_map: Tuple[Tuple[str, str], ...],
  432 +) -> None:
  433 + for source_name, output_name in field_map:
  434 + raw = _get_analysis_field_value(row, source_name, schema)
  435 + if not raw:
  436 + continue
  437 + _append_named_lang_phrase_map(
  438 + target,
  439 + name=output_name,
  440 + lang=lang,
  441 + raw_value=raw,
  442 + )
306 443  
307 444  
308 445 def _apply_index_content_row(result: Dict[str, Any], row: Dict[str, Any], lang: str) -> None:
309 446 if not row or row.get("error"):
310 447 return
311 448  
312   - anchor_text = str(_get_analysis_field_value(row, "anchor_text") or "").strip()
  449 + content_schema = _get_analysis_schema("content")
  450 + anchor_text = str(_get_analysis_field_value(row, "anchor_text", content_schema) or "").strip()
313 451 if anchor_text:
314 452 _append_lang_phrase_map(result["qanchors"], lang=lang, raw_value=anchor_text)
315 453  
316   - for source_name, output_name in _ANALYSIS_ATTRIBUTE_FIELD_MAP:
317   - raw = _get_analysis_field_value(row, source_name)
  454 + for source_name, output_name in _CONTENT_ANALYSIS_ATTRIBUTE_FIELD_MAP:
  455 + raw = _get_analysis_field_value(row, source_name, content_schema)
318 456 if not raw:
319 457 continue
320 458 _append_named_lang_phrase_map(
... ... @@ -327,6 +465,28 @@ def _apply_index_content_row(result: Dict[str, Any], row: Dict[str, Any], lang:
327 465 _append_lang_phrase_map(result["enriched_tags"], lang=lang, raw_value=raw)
328 466  
329 467  
  468 +def _apply_index_taxonomy_row(
  469 + result: Dict[str, Any],
  470 + row: Dict[str, Any],
  471 + lang: str,
  472 + *,
  473 + category_taxonomy_profile: Optional[str] = None,
  474 +) -> None:
  475 + if not row or row.get("error"):
  476 + return
  477 +
  478 + _append_analysis_attributes(
  479 + result["enriched_taxonomy_attributes"],
  480 + row=row,
  481 + lang=lang,
  482 + schema=_get_analysis_schema(
  483 + "taxonomy",
  484 + category_taxonomy_profile=category_taxonomy_profile,
  485 + ),
  486 + field_map=_get_taxonomy_attribute_field_map(category_taxonomy_profile),
  487 + )
  488 +
  489 +
330 490 def _normalize_index_content_item(item: Dict[str, Any]) -> Dict[str, str]:
331 491 item_id = _get_product_id(item)
332 492 return {
... ... @@ -341,6 +501,8 @@ def _normalize_index_content_item(item: Dict[str, Any]) -&gt; Dict[str, str]:
341 501 def build_index_content_fields(
342 502 items: List[Dict[str, Any]],
343 503 tenant_id: Optional[str] = None,
  504 + enrichment_scopes: Optional[List[str]] = None,
  505 + category_taxonomy_profile: Optional[str] = None,
344 506 ) -> List[Dict[str, Any]]:
345 507 """
346 508 高层入口:生成与 ES mapping 对齐的内容理解字段。
... ... @@ -349,18 +511,23 @@ def build_index_content_fields(
349 511 - `id` 或 `spu_id`
350 512 - `title`
351 513 - 可选 `brief` / `description` / `image_url`
  514 + - 可选 `enrichment_scopes`,默认同时执行 `generic` 与 `category_taxonomy`
  515 + - 可选 `category_taxonomy_profile`,默认 `apparel`
352 516  
353 517 返回项结构:
354 518 - `id`
355 519 - `qanchors`
356 520 - `enriched_tags`
357 521 - `enriched_attributes`
  522 + - `enriched_taxonomy_attributes`
358 523 - 可选 `error`
359 524  
360 525 其中:
361 526 - `qanchors.{lang}` 为短语数组
362 527 - `enriched_tags.{lang}` 为标签数组
363 528 """
  529 + requested_enrichment_scopes = _normalize_enrichment_scopes(enrichment_scopes)
  530 + normalized_taxonomy_profile = _normalize_category_taxonomy_profile(category_taxonomy_profile)
364 531 normalized_items = [_normalize_index_content_item(item) for item in items]
365 532 if not normalized_items:
366 533 return []
... ... @@ -371,32 +538,72 @@ def build_index_content_fields(
371 538 "qanchors": {},
372 539 "enriched_tags": {},
373 540 "enriched_attributes": [],
  541 + "enriched_taxonomy_attributes": [],
374 542 }
375 543 for item in normalized_items
376 544 }
377 545  
378 546 for lang in _CORE_INDEX_LANGUAGES:
379   - try:
380   - rows = analyze_products(
381   - products=normalized_items,
382   - target_lang=lang,
383   - batch_size=BATCH_SIZE,
384   - tenant_id=tenant_id,
385   - )
386   - except Exception as e:
387   - logger.warning("build_index_content_fields failed for lang=%s: %s", lang, e)
388   - for item in normalized_items:
389   - results_by_id[item["id"]].setdefault("error", str(e))
390   - continue
391   -
392   - for row in rows or []:
393   - item_id = str(row.get("id") or "").strip()
394   - if not item_id or item_id not in results_by_id:
  547 + if "generic" in requested_enrichment_scopes:
  548 + try:
  549 + rows = analyze_products(
  550 + products=normalized_items,
  551 + target_lang=lang,
  552 + batch_size=BATCH_SIZE,
  553 + tenant_id=tenant_id,
  554 + analysis_kind="content",
  555 + category_taxonomy_profile=normalized_taxonomy_profile,
  556 + )
  557 + except Exception as e:
  558 + logger.warning("build_index_content_fields content enrichment failed for lang=%s: %s", lang, e)
  559 + for item in normalized_items:
  560 + results_by_id[item["id"]].setdefault("error", str(e))
395 561 continue
396   - if row.get("error"):
397   - results_by_id[item_id].setdefault("error", row["error"])
  562 +
  563 + for row in rows or []:
  564 + item_id = str(row.get("id") or "").strip()
  565 + if not item_id or item_id not in results_by_id:
  566 + continue
  567 + if row.get("error"):
  568 + results_by_id[item_id].setdefault("error", row["error"])
  569 + continue
  570 + _apply_index_content_row(results_by_id[item_id], row=row, lang=lang)
  571 +
  572 + if "category_taxonomy" in requested_enrichment_scopes:
  573 + for lang in _CORE_INDEX_LANGUAGES:
  574 + try:
  575 + taxonomy_rows = analyze_products(
  576 + products=normalized_items,
  577 + target_lang=lang,
  578 + batch_size=BATCH_SIZE,
  579 + tenant_id=tenant_id,
  580 + analysis_kind="taxonomy",
  581 + category_taxonomy_profile=normalized_taxonomy_profile,
  582 + )
  583 + except Exception as e:
  584 + logger.warning(
  585 + "build_index_content_fields taxonomy enrichment failed for profile=%s lang=%s: %s",
  586 + normalized_taxonomy_profile,
  587 + lang,
  588 + e,
  589 + )
  590 + for item in normalized_items:
  591 + results_by_id[item["id"]].setdefault("error", str(e))
398 592 continue
399   - _apply_index_content_row(results_by_id[item_id], row=row, lang=lang)
  593 +
  594 + for row in taxonomy_rows or []:
  595 + item_id = str(row.get("id") or "").strip()
  596 + if not item_id or item_id not in results_by_id:
  597 + continue
  598 + if row.get("error"):
  599 + results_by_id[item_id].setdefault("error", row["error"])
  600 + continue
  601 + _apply_index_taxonomy_row(
  602 + results_by_id[item_id],
  603 + row=row,
  604 + lang=lang,
  605 + category_taxonomy_profile=normalized_taxonomy_profile,
  606 + )
400 607  
401 608 return [results_by_id[item["id"]] for item in normalized_items]
402 609  
... ... @@ -463,52 +670,129 @@ def _build_prompt_input_text(product: Dict[str, Any]) -&gt; str:
463 670 return _truncate_by_words(candidate, PROMPT_INPUT_MAX_WORDS)
464 671  
465 672  
466   -def _make_anchor_cache_key(
  673 +def _make_analysis_cache_key(
467 674 product: Dict[str, Any],
468 675 target_lang: str,
  676 + analysis_kind: str,
  677 + category_taxonomy_profile: Optional[str] = None,
469 678 ) -> str:
470   - """构造缓存 key,仅由 prompt 实际输入文本内容 + 目标语言决定。"""
  679 + """构造缓存 key,仅由分析类型、prompt 实际输入文本内容与目标语言决定。"""
  680 + schema = _get_analysis_schema(
  681 + analysis_kind,
  682 + category_taxonomy_profile=category_taxonomy_profile,
  683 + )
471 684 prompt_input = _build_prompt_input_text(product)
472 685 h = hashlib.md5(prompt_input.encode("utf-8")).hexdigest()
473   - return f"{ANCHOR_CACHE_PREFIX}:{target_lang}:{prompt_input[:4]}{h}"
  686 + prompt_contract = {
  687 + "schema_name": schema.name,
  688 + "cache_version": schema.cache_version,
  689 + "system_message": SYSTEM_MESSAGE,
  690 + "user_instruction_template": USER_INSTRUCTION_TEMPLATE,
  691 + "shared_instruction": schema.shared_instruction,
  692 + "assistant_headers": schema.get_headers(target_lang),
  693 + "result_fields": schema.result_fields,
  694 + "meaningful_fields": schema.meaningful_fields,
  695 + "field_aliases": schema.field_aliases,
  696 + }
  697 + prompt_contract_hash = hashlib.md5(
  698 + json.dumps(prompt_contract, ensure_ascii=False, sort_keys=True).encode("utf-8")
  699 + ).hexdigest()[:12]
  700 + return (
  701 + f"{ANCHOR_CACHE_PREFIX}:{analysis_kind}:{prompt_contract_hash}:"
  702 + f"{target_lang}:{prompt_input[:4]}{h}"
  703 + )
474 704  
475 705  
476   -def _get_cached_anchor_result(
  706 +def _make_anchor_cache_key(
477 707 product: Dict[str, Any],
478 708 target_lang: str,
  709 +) -> str:
  710 + return _make_analysis_cache_key(product, target_lang, analysis_kind="content")
  711 +
  712 +
  713 +def _get_cached_analysis_result(
  714 + product: Dict[str, Any],
  715 + target_lang: str,
  716 + analysis_kind: str,
  717 + category_taxonomy_profile: Optional[str] = None,
479 718 ) -> Optional[Dict[str, Any]]:
480 719 if not _anchor_redis:
481 720 return None
  721 + schema = _get_analysis_schema(
  722 + analysis_kind,
  723 + category_taxonomy_profile=category_taxonomy_profile,
  724 + )
482 725 try:
483   - key = _make_anchor_cache_key(product, target_lang)
  726 + key = _make_analysis_cache_key(
  727 + product,
  728 + target_lang,
  729 + analysis_kind,
  730 + category_taxonomy_profile=category_taxonomy_profile,
  731 + )
484 732 raw = _anchor_redis.get(key)
485 733 if not raw:
486 734 return None
487   - result = _normalize_analysis_result(json.loads(raw), product=product, target_lang=target_lang)
488   - if not _has_meaningful_analysis_content(result):
  735 + result = _normalize_analysis_result(
  736 + json.loads(raw),
  737 + product=product,
  738 + target_lang=target_lang,
  739 + schema=schema,
  740 + )
  741 + if not _has_meaningful_analysis_content(result, schema):
489 742 return None
490 743 return result
491 744 except Exception as e:
492   - logger.warning(f"Failed to get anchor cache: {e}")
  745 + logger.warning("Failed to get %s analysis cache: %s", analysis_kind, e)
493 746 return None
494 747  
495 748  
496   -def _set_cached_anchor_result(
  749 +def _get_cached_anchor_result(
  750 + product: Dict[str, Any],
  751 + target_lang: str,
  752 +) -> Optional[Dict[str, Any]]:
  753 + return _get_cached_analysis_result(product, target_lang, analysis_kind="content")
  754 +
  755 +
  756 +def _set_cached_analysis_result(
497 757 product: Dict[str, Any],
498 758 target_lang: str,
499 759 result: Dict[str, Any],
  760 + analysis_kind: str,
  761 + category_taxonomy_profile: Optional[str] = None,
500 762 ) -> None:
501 763 if not _anchor_redis:
502 764 return
  765 + schema = _get_analysis_schema(
  766 + analysis_kind,
  767 + category_taxonomy_profile=category_taxonomy_profile,
  768 + )
503 769 try:
504   - normalized = _normalize_analysis_result(result, product=product, target_lang=target_lang)
505   - if not _has_meaningful_analysis_content(normalized):
  770 + normalized = _normalize_analysis_result(
  771 + result,
  772 + product=product,
  773 + target_lang=target_lang,
  774 + schema=schema,
  775 + )
  776 + if not _has_meaningful_analysis_content(normalized, schema):
506 777 return
507   - key = _make_anchor_cache_key(product, target_lang)
  778 + key = _make_analysis_cache_key(
  779 + product,
  780 + target_lang,
  781 + analysis_kind,
  782 + category_taxonomy_profile=category_taxonomy_profile,
  783 + )
508 784 ttl = ANCHOR_CACHE_EXPIRE_DAYS * 24 * 3600
509 785 _anchor_redis.setex(key, ttl, json.dumps(normalized, ensure_ascii=False))
510 786 except Exception as e:
511   - logger.warning(f"Failed to set anchor cache: {e}")
  787 + logger.warning("Failed to set %s analysis cache: %s", analysis_kind, e)
  788 +
  789 +
  790 +def _set_cached_anchor_result(
  791 + product: Dict[str, Any],
  792 + target_lang: str,
  793 + result: Dict[str, Any],
  794 +) -> None:
  795 + _set_cached_analysis_result(product, target_lang, result, analysis_kind="content")
512 796  
513 797  
514 798 def _build_assistant_prefix(headers: List[str]) -> str:
... ... @@ -517,8 +801,8 @@ def _build_assistant_prefix(headers: List[str]) -&gt; str:
517 801 return f"{header_line}\n{separator_line}\n"
518 802  
519 803  
520   -def _build_shared_context(products: List[Dict[str, str]]) -> str:
521   - shared_context = SHARED_ANALYSIS_INSTRUCTION
  804 +def _build_shared_context(products: List[Dict[str, str]], schema: AnalysisSchema) -> str:
  805 + shared_context = schema.shared_instruction
522 806 for idx, product in enumerate(products, 1):
523 807 prompt_input = _build_prompt_input_text(product)
524 808 shared_context += f"{idx}. {prompt_input}\n"
... ... @@ -550,16 +834,23 @@ def reset_logged_shared_context_keys() -&gt; None:
550 834 def create_prompt(
551 835 products: List[Dict[str, str]],
552 836 target_lang: str = "zh",
553   -) -> Tuple[str, str, str]:
  837 + analysis_kind: str = "content",
  838 + category_taxonomy_profile: Optional[str] = None,
  839 +) -> Tuple[Optional[str], Optional[str], Optional[str]]:
554 840 """根据目标语言创建共享上下文、本地化输出要求和 Partial Mode assistant 前缀。"""
555   - markdown_table_headers = LANGUAGE_MARKDOWN_TABLE_HEADERS.get(target_lang)
  841 + schema = _get_analysis_schema(
  842 + analysis_kind,
  843 + category_taxonomy_profile=category_taxonomy_profile,
  844 + )
  845 + markdown_table_headers = schema.get_headers(target_lang)
556 846 if not markdown_table_headers:
557 847 logger.warning(
558   - "Unsupported target_lang for markdown table headers: %s",
  848 + "Unsupported target_lang for markdown table headers: kind=%s lang=%s",
  849 + analysis_kind,
559 850 target_lang,
560 851 )
561 852 return None, None, None
562   - shared_context = _build_shared_context(products)
  853 + shared_context = _build_shared_context(products, schema)
563 854 language_label = SOURCE_LANG_CODE_MAP.get(target_lang, target_lang)
564 855 user_prompt = USER_INSTRUCTION_TEMPLATE.format(language=language_label).strip()
565 856 assistant_prefix = _build_assistant_prefix(markdown_table_headers)
... ... @@ -592,6 +883,7 @@ def call_llm(
592 883 user_prompt: str,
593 884 assistant_prefix: str,
594 885 target_lang: str = "zh",
  886 + analysis_kind: str = "content",
595 887 ) -> Tuple[str, str]:
596 888 """调用大模型 API(带重试机制),使用 Partial Mode 强制 markdown 表格前缀。"""
597 889 headers = {
... ... @@ -631,8 +923,9 @@ def call_llm(
631 923 if _mark_shared_context_logged_once(shared_context_key):
632 924 logger.info(f"\n{'=' * 80}")
633 925 logger.info(
634   - "LLM Shared Context [model=%s, shared_key=%s, chars=%s] (logged once per process key)",
  926 + "LLM Shared Context [model=%s, kind=%s, shared_key=%s, chars=%s] (logged once per process key)",
635 927 MODEL_NAME,
  928 + analysis_kind,
636 929 shared_context_key,
637 930 len(shared_context),
638 931 )
... ... @@ -641,8 +934,9 @@ def call_llm(
641 934  
642 935 verbose_logger.info(f"\n{'=' * 80}")
643 936 verbose_logger.info(
644   - "LLM Request [model=%s, lang=%s, shared_key=%s, tail_key=%s]:",
  937 + "LLM Request [model=%s, kind=%s, lang=%s, shared_key=%s, tail_key=%s]:",
645 938 MODEL_NAME,
  939 + analysis_kind,
646 940 target_lang,
647 941 shared_context_key,
648 942 localized_tail_key,
... ... @@ -654,7 +948,8 @@ def call_llm(
654 948 verbose_logger.info(f"\nAssistant Prefix:\n{assistant_prefix}")
655 949  
656 950 logger.info(
657   - "\nLLM Request Variant [lang=%s, shared_key=%s, tail_key=%s, prompt_chars=%s, prefix_chars=%s]",
  951 + "\nLLM Request Variant [kind=%s, lang=%s, shared_key=%s, tail_key=%s, prompt_chars=%s, prefix_chars=%s]",
  952 + analysis_kind,
658 953 target_lang,
659 954 shared_context_key,
660 955 localized_tail_key,
... ... @@ -685,8 +980,9 @@ def call_llm(
685 980 usage = result.get("usage") or {}
686 981  
687 982 verbose_logger.info(
688   - "\nLLM Response [model=%s, lang=%s, shared_key=%s, tail_key=%s]:",
  983 + "\nLLM Response [model=%s, kind=%s, lang=%s, shared_key=%s, tail_key=%s]:",
689 984 MODEL_NAME,
  985 + analysis_kind,
690 986 target_lang,
691 987 shared_context_key,
692 988 localized_tail_key,
... ... @@ -697,7 +993,8 @@ def call_llm(
697 993 full_markdown = _merge_partial_response(assistant_prefix, generated_content)
698 994  
699 995 logger.info(
700   - "\nLLM Response Summary [lang=%s, shared_key=%s, tail_key=%s, generated_chars=%s, completion_tokens=%s, prompt_tokens=%s, total_tokens=%s]",
  996 + "\nLLM Response Summary [kind=%s, lang=%s, shared_key=%s, tail_key=%s, generated_chars=%s, completion_tokens=%s, prompt_tokens=%s, total_tokens=%s]",
  997 + analysis_kind,
701 998 target_lang,
702 999 shared_context_key,
703 1000 localized_tail_key,
... ... @@ -742,8 +1039,16 @@ def call_llm(
742 1039 session.close()
743 1040  
744 1041  
745   -def parse_markdown_table(markdown_content: str) -> List[Dict[str, str]]:
  1042 +def parse_markdown_table(
  1043 + markdown_content: str,
  1044 + analysis_kind: str = "content",
  1045 + category_taxonomy_profile: Optional[str] = None,
  1046 +) -> List[Dict[str, str]]:
746 1047 """解析markdown表格内容"""
  1048 + schema = _get_analysis_schema(
  1049 + analysis_kind,
  1050 + category_taxonomy_profile=category_taxonomy_profile,
  1051 + )
747 1052 lines = markdown_content.strip().split("\n")
748 1053 data = []
749 1054 data_started = False
... ... @@ -768,22 +1073,16 @@ def parse_markdown_table(markdown_content: str) -&gt; List[Dict[str, str]]:
768 1073  
769 1074 # 解析数据行
770 1075 parts = [p.strip() for p in line.split("|")]
771   - parts = [p for p in parts if p] # 移除空字符串
  1076 + if parts and parts[0] == "":
  1077 + parts = parts[1:]
  1078 + if parts and parts[-1] == "":
  1079 + parts = parts[:-1]
772 1080  
773 1081 if len(parts) >= 2:
774   - row = {
775   - "seq_no": parts[0],
776   - "title": parts[1], # 商品标题(按目标语言)
777   - "category_path": parts[2] if len(parts) > 2 else "", # 品类路径
778   - "tags": parts[3] if len(parts) > 3 else "", # 细分标签
779   - "target_audience": parts[4] if len(parts) > 4 else "", # 适用人群
780   - "usage_scene": parts[5] if len(parts) > 5 else "", # 使用场景
781   - "season": parts[6] if len(parts) > 6 else "", # 适用季节
782   - "key_attributes": parts[7] if len(parts) > 7 else "", # 关键属性
783   - "material": parts[8] if len(parts) > 8 else "", # 材质说明
784   - "features": parts[9] if len(parts) > 9 else "", # 功能特点
785   - "anchor_text": parts[10] if len(parts) > 10 else "", # 锚文本
786   - }
  1082 + row = {"seq_no": parts[0]}
  1083 + for field_index, field_name in enumerate(schema.result_fields, start=1):
  1084 + cell = parts[field_index] if len(parts) > field_index else ""
  1085 + row[field_name] = _normalize_markdown_table_cell(cell)
787 1086 data.append(row)
788 1087  
789 1088 return data
... ... @@ -794,31 +1093,49 @@ def _log_parsed_result_quality(
794 1093 parsed_results: List[Dict[str, str]],
795 1094 target_lang: str,
796 1095 batch_num: int,
  1096 + analysis_kind: str,
  1097 + category_taxonomy_profile: Optional[str] = None,
797 1098 ) -> None:
  1099 + schema = _get_analysis_schema(
  1100 + analysis_kind,
  1101 + category_taxonomy_profile=category_taxonomy_profile,
  1102 + )
798 1103 expected = len(batch_data)
799 1104 actual = len(parsed_results)
800 1105 if actual != expected:
801 1106 logger.warning(
802   - "Parsed row count mismatch for batch=%s lang=%s: expected=%s actual=%s",
  1107 + "Parsed row count mismatch for kind=%s batch=%s lang=%s: expected=%s actual=%s",
  1108 + analysis_kind,
803 1109 batch_num,
804 1110 target_lang,
805 1111 expected,
806 1112 actual,
807 1113 )
808 1114  
809   - missing_anchor = sum(1 for item in parsed_results if not str(item.get("anchor_text") or "").strip())
810   - missing_category = sum(1 for item in parsed_results if not str(item.get("category_path") or "").strip())
811   - missing_title = sum(1 for item in parsed_results if not str(item.get("title") or "").strip())
  1115 + if not schema.quality_fields:
  1116 + logger.info(
  1117 + "Parsed Quality Summary [kind=%s, batch=%s, lang=%s]: rows=%s/%s",
  1118 + analysis_kind,
  1119 + batch_num,
  1120 + target_lang,
  1121 + actual,
  1122 + expected,
  1123 + )
  1124 + return
812 1125  
  1126 + missing_summary = ", ".join(
  1127 + f"missing_{field}="
  1128 + f"{sum(1 for item in parsed_results if not str(item.get(field) or '').strip())}"
  1129 + for field in schema.quality_fields
  1130 + )
813 1131 logger.info(
814   - "Parsed Quality Summary [batch=%s, lang=%s]: rows=%s/%s, missing_title=%s, missing_category=%s, missing_anchor=%s",
  1132 + "Parsed Quality Summary [kind=%s, batch=%s, lang=%s]: rows=%s/%s, %s",
  1133 + analysis_kind,
815 1134 batch_num,
816 1135 target_lang,
817 1136 actual,
818 1137 expected,
819   - missing_title,
820   - missing_category,
821   - missing_anchor,
  1138 + missing_summary,
822 1139 )
823 1140  
824 1141  
... ... @@ -826,29 +1143,44 @@ def process_batch(
826 1143 batch_data: List[Dict[str, str]],
827 1144 batch_num: int,
828 1145 target_lang: str = "zh",
  1146 + analysis_kind: str = "content",
  1147 + category_taxonomy_profile: Optional[str] = None,
829 1148 ) -> List[Dict[str, Any]]:
830 1149 """处理一个批次的数据"""
  1150 + schema = _get_analysis_schema(
  1151 + analysis_kind,
  1152 + category_taxonomy_profile=category_taxonomy_profile,
  1153 + )
831 1154 logger.info(f"\n{'#' * 80}")
832   - logger.info(f"Processing Batch {batch_num} ({len(batch_data)} items)")
  1155 + logger.info(
  1156 + "Processing Batch %s (%s items, kind=%s)",
  1157 + batch_num,
  1158 + len(batch_data),
  1159 + analysis_kind,
  1160 + )
833 1161  
834 1162 # 创建提示词
835 1163 shared_context, user_prompt, assistant_prefix = create_prompt(
836 1164 batch_data,
837 1165 target_lang=target_lang,
  1166 + analysis_kind=analysis_kind,
  1167 + category_taxonomy_profile=category_taxonomy_profile,
838 1168 )
839 1169  
840 1170 # 如果提示词创建失败(例如不支持的 target_lang),本次批次整体失败,不再继续调用 LLM
841 1171 if shared_context is None or user_prompt is None or assistant_prefix is None:
842 1172 logger.error(
843   - "Failed to create prompt for batch %s, target_lang=%s; "
  1173 + "Failed to create prompt for batch %s, kind=%s, target_lang=%s; "
844 1174 "marking entire batch as failed without calling LLM",
845 1175 batch_num,
  1176 + analysis_kind,
846 1177 target_lang,
847 1178 )
848 1179 return [
849 1180 _make_empty_analysis_result(
850 1181 item,
851 1182 target_lang,
  1183 + schema,
852 1184 error=f"prompt_creation_failed: unsupported target_lang={target_lang}",
853 1185 )
854 1186 for item in batch_data
... ... @@ -861,11 +1193,23 @@ def process_batch(
861 1193 user_prompt,
862 1194 assistant_prefix,
863 1195 target_lang=target_lang,
  1196 + analysis_kind=analysis_kind,
864 1197 )
865 1198  
866 1199 # 解析结果
867   - parsed_results = parse_markdown_table(raw_response)
868   - _log_parsed_result_quality(batch_data, parsed_results, target_lang, batch_num)
  1200 + parsed_results = parse_markdown_table(
  1201 + raw_response,
  1202 + analysis_kind=analysis_kind,
  1203 + category_taxonomy_profile=category_taxonomy_profile,
  1204 + )
  1205 + _log_parsed_result_quality(
  1206 + batch_data,
  1207 + parsed_results,
  1208 + target_lang,
  1209 + batch_num,
  1210 + analysis_kind,
  1211 + category_taxonomy_profile,
  1212 + )
869 1213  
870 1214 logger.info(f"\nParsed Results ({len(parsed_results)} items):")
871 1215 logger.info(json.dumps(parsed_results, ensure_ascii=False, indent=2))
... ... @@ -879,10 +1223,12 @@ def process_batch(
879 1223 parsed_item,
880 1224 product=source_product,
881 1225 target_lang=target_lang,
  1226 + schema=schema,
882 1227 )
883 1228 results_with_ids.append(result)
884 1229 logger.info(
885   - "Mapped: seq=%s -> original_id=%s",
  1230 + "Mapped: kind=%s seq=%s -> original_id=%s",
  1231 + analysis_kind,
886 1232 parsed_item.get("seq_no"),
887 1233 source_product.get("id"),
888 1234 )
... ... @@ -890,6 +1236,7 @@ def process_batch(
890 1236 # 保存批次 JSON 日志到独立文件
891 1237 batch_log = {
892 1238 "batch_num": batch_num,
  1239 + "analysis_kind": analysis_kind,
893 1240 "timestamp": datetime.now().isoformat(),
894 1241 "input_products": batch_data,
895 1242 "raw_response": raw_response,
... ... @@ -900,7 +1247,10 @@ def process_batch(
900 1247  
901 1248 # 并发写 batch json 日志时,保证文件名唯一避免覆盖
902 1249 batch_call_id = uuid.uuid4().hex[:12]
903   - batch_log_file = LOG_DIR / f"batch_{batch_num:04d}_{timestamp}_{batch_call_id}.json"
  1250 + batch_log_file = (
  1251 + LOG_DIR
  1252 + / f"batch_{analysis_kind}_{batch_num:04d}_{timestamp}_{batch_call_id}.json"
  1253 + )
904 1254 with open(batch_log_file, "w", encoding="utf-8") as f:
905 1255 json.dump(batch_log, f, ensure_ascii=False, indent=2)
906 1256  
... ... @@ -912,7 +1262,7 @@ def process_batch(
912 1262 logger.error(f"Error processing batch {batch_num}: {str(e)}", exc_info=True)
913 1263 # 返回空结果,保持ID映射
914 1264 return [
915   - _make_empty_analysis_result(item, target_lang, error=str(e))
  1265 + _make_empty_analysis_result(item, target_lang, schema, error=str(e))
916 1266 for item in batch_data
917 1267 ]
918 1268  
... ... @@ -922,6 +1272,8 @@ def analyze_products(
922 1272 target_lang: str = "zh",
923 1273 batch_size: Optional[int] = None,
924 1274 tenant_id: Optional[str] = None,
  1275 + analysis_kind: str = "content",
  1276 + category_taxonomy_profile: Optional[str] = None,
925 1277 ) -> List[Dict[str, Any]]:
926 1278 """
927 1279 库调用入口:根据输入+语言,返回锚文本及各维度信息。
... ... @@ -937,6 +1289,10 @@ def analyze_products(
937 1289 if not products:
938 1290 return []
939 1291  
  1292 + _get_analysis_schema(
  1293 + analysis_kind,
  1294 + category_taxonomy_profile=category_taxonomy_profile,
  1295 + )
940 1296 results_by_index: List[Optional[Dict[str, Any]]] = [None] * len(products)
941 1297 uncached_items: List[Tuple[int, Dict[str, str]]] = []
942 1298  
... ... @@ -946,11 +1302,16 @@ def analyze_products(
946 1302 uncached_items.append((idx, product))
947 1303 continue
948 1304  
949   - cached = _get_cached_anchor_result(product, target_lang)
  1305 + cached = _get_cached_analysis_result(
  1306 + product,
  1307 + target_lang,
  1308 + analysis_kind,
  1309 + category_taxonomy_profile=category_taxonomy_profile,
  1310 + )
950 1311 if cached:
951 1312 logger.info(
952 1313 f"[analyze_products] Cache hit for title='{title[:50]}...', "
953   - f"lang={target_lang}"
  1314 + f"kind={analysis_kind}, lang={target_lang}"
954 1315 )
955 1316 results_by_index[idx] = cached
956 1317 continue
... ... @@ -979,9 +1340,15 @@ def analyze_products(
979 1340 for batch_num, batch_slice, batch in batch_jobs:
980 1341 logger.info(
981 1342 f"[analyze_products] Processing batch {batch_num}/{total_batches}, "
982   - f"size={len(batch)}, target_lang={target_lang}"
  1343 + f"size={len(batch)}, kind={analysis_kind}, target_lang={target_lang}"
  1344 + )
  1345 + batch_results = process_batch(
  1346 + batch,
  1347 + batch_num=batch_num,
  1348 + target_lang=target_lang,
  1349 + analysis_kind=analysis_kind,
  1350 + category_taxonomy_profile=category_taxonomy_profile,
983 1351 )
984   - batch_results = process_batch(batch, batch_num=batch_num, target_lang=target_lang)
985 1352  
986 1353 for (original_idx, product), item in zip(batch_slice, batch_results):
987 1354 results_by_index[original_idx] = item
... ... @@ -992,7 +1359,13 @@ def analyze_products(
992 1359 # 不缓存错误结果,避免放大临时故障
993 1360 continue
994 1361 try:
995   - _set_cached_anchor_result(product, target_lang, item)
  1362 + _set_cached_analysis_result(
  1363 + product,
  1364 + target_lang,
  1365 + item,
  1366 + analysis_kind,
  1367 + category_taxonomy_profile=category_taxonomy_profile,
  1368 + )
996 1369 except Exception:
997 1370 # 已在内部记录 warning
998 1371 pass
... ... @@ -1000,10 +1373,11 @@ def analyze_products(
1000 1373 max_workers = min(CONTENT_UNDERSTANDING_MAX_WORKERS, len(batch_jobs))
1001 1374 logger.info(
1002 1375 "[analyze_products] Using ThreadPoolExecutor for uncached batches: "
1003   - "max_workers=%s, total_batches=%s, bs=%s, target_lang=%s",
  1376 + "max_workers=%s, total_batches=%s, bs=%s, kind=%s, target_lang=%s",
1004 1377 max_workers,
1005 1378 total_batches,
1006 1379 bs,
  1380 + analysis_kind,
1007 1381 target_lang,
1008 1382 )
1009 1383  
... ... @@ -1013,7 +1387,12 @@ def analyze_products(
1013 1387 future_by_batch_num: Dict[int, Any] = {}
1014 1388 for batch_num, _batch_slice, batch in batch_jobs:
1015 1389 future_by_batch_num[batch_num] = executor.submit(
1016   - process_batch, batch, batch_num=batch_num, target_lang=target_lang
  1390 + process_batch,
  1391 + batch,
  1392 + batch_num=batch_num,
  1393 + target_lang=target_lang,
  1394 + analysis_kind=analysis_kind,
  1395 + category_taxonomy_profile=category_taxonomy_profile,
1017 1396 )
1018 1397  
1019 1398 # 按 batch_num 回填,确保输出稳定(results_by_index 是按原始 input index 映射的)
... ... @@ -1028,7 +1407,13 @@ def analyze_products(
1028 1407 # 不缓存错误结果,避免放大临时故障
1029 1408 continue
1030 1409 try:
1031   - _set_cached_anchor_result(product, target_lang, item)
  1410 + _set_cached_analysis_result(
  1411 + product,
  1412 + target_lang,
  1413 + item,
  1414 + analysis_kind,
  1415 + category_taxonomy_profile=category_taxonomy_profile,
  1416 + )
1032 1417 except Exception:
1033 1418 # 已在内部记录 warning
1034 1419 pass
... ...
indexer/product_enrich_prompts.py
1 1 #!/usr/bin/env python3
2 2  
3   -from typing import Any, Dict
  3 +from typing import Any, Dict, Tuple
4 4  
5 5 SYSTEM_MESSAGE = (
6 6 "You are an e-commerce product annotator. "
... ... @@ -33,6 +33,337 @@ Input product list:
33 33 USER_INSTRUCTION_TEMPLATE = """Please strictly return a Markdown table following the given columns in the specified language. For any column containing multiple values, separate them with commas. Do not add any other explanation.
34 34 Language: {language}"""
35 35  
  36 +def _taxonomy_field(
  37 + key: str,
  38 + label: str,
  39 + description: str,
  40 + zh_label: str | None = None,
  41 +) -> Dict[str, str]:
  42 + return {
  43 + "key": key,
  44 + "label": label,
  45 + "description": description,
  46 + "zh_label": zh_label or label,
  47 + }
  48 +
  49 +
  50 +def _build_taxonomy_shared_instruction(profile_label: str, fields: Tuple[Dict[str, str], ...]) -> str:
  51 + lines = [
  52 + f"Analyze each input product text and fill the columns below using a {profile_label} attribute taxonomy.",
  53 + "",
  54 + "Output columns:",
  55 + ]
  56 + for idx, field in enumerate(fields, start=1):
  57 + lines.append(f"{idx}. {field['label']}: {field['description']}")
  58 + lines.extend(
  59 + [
  60 + "",
  61 + "Rules:",
  62 + "- Keep the same row order and row count as input.",
  63 + "- Leave blank if not applicable, unmentioned, or unsupported.",
  64 + "- Use concise, standardized ecommerce wording.",
  65 + "- If multiple values, separate with commas.",
  66 + "",
  67 + "Input product list:",
  68 + ]
  69 + )
  70 + return "\n".join(lines)
  71 +
  72 +
  73 +def _make_taxonomy_profile(
  74 + profile_label: str,
  75 + fields: Tuple[Dict[str, str], ...],
  76 +) -> Dict[str, Any]:
  77 + headers = {
  78 + "en": ["No.", *[field["label"] for field in fields]],
  79 + "zh": ["序号", *[field["zh_label"] for field in fields]],
  80 + }
  81 + return {
  82 + "profile_label": profile_label,
  83 + "fields": fields,
  84 + "shared_instruction": _build_taxonomy_shared_instruction(profile_label, fields),
  85 + "markdown_table_headers": headers,
  86 + }
  87 +
  88 +
  89 +APPAREL_TAXONOMY_FIELDS = (
  90 + _taxonomy_field("product_type", "Product Type", "concise ecommerce apparel category label, not a full marketing title", "品类"),
  91 + _taxonomy_field("target_gender", "Target Gender", "intended gender only if clearly implied", "目标性别"),
  92 + _taxonomy_field("age_group", "Age Group", "only if clearly implied, e.g. adults, kids, teens, toddlers, babies", "年龄段"),
  93 + _taxonomy_field("season", "Season", "season(s) or all-season suitability only if supported", "适用季节"),
  94 + _taxonomy_field("fit", "Fit", "body closeness, e.g. slim, regular, relaxed, oversized, fitted", "版型"),
  95 + _taxonomy_field("silhouette", "Silhouette", "overall garment shape, e.g. straight, A-line, boxy, tapered, bodycon, wide-leg", "廓形"),
  96 + _taxonomy_field("neckline", "Neckline", "neckline type when applicable, e.g. crew neck, V-neck, hooded, collared, square neck", "领型"),
  97 + _taxonomy_field("sleeve_length_type", "Sleeve Length Type", "sleeve length only, e.g. sleeveless, short sleeve, long sleeve, three-quarter sleeve", "袖长类型"),
  98 + _taxonomy_field("sleeve_style", "Sleeve Style", "sleeve design only, e.g. puff sleeve, raglan sleeve, batwing sleeve, bell sleeve", "袖型"),
  99 + _taxonomy_field("strap_type", "Strap Type", "strap design when applicable, e.g. spaghetti strap, wide strap, halter strap, adjustable strap", "肩带设计"),
  100 + _taxonomy_field("rise_waistline", "Rise / Waistline", "waist placement when applicable, e.g. high rise, mid rise, low rise, empire waist", "腰型"),
  101 + _taxonomy_field("leg_shape", "Leg Shape", "for bottoms only, e.g. straight leg, wide leg, flare leg, tapered leg, skinny leg", "裤型"),
  102 + _taxonomy_field("skirt_shape", "Skirt Shape", "for skirts only, e.g. A-line, pleated, pencil, mermaid", "裙型"),
  103 + _taxonomy_field("length_type", "Length Type", "design length only, not size, e.g. cropped, regular, longline, mini, midi, maxi, ankle length, full length", "长度类型"),
  104 + _taxonomy_field("closure_type", "Closure Type", "fastening method when applicable, e.g. zipper, button, drawstring, elastic waist, hook-and-loop", "闭合方式"),
  105 + _taxonomy_field("design_details", "Design Details", "construction or visual details, e.g. ruched, ruffled, pleated, cut-out, layered, distressed, split hem", "设计细节"),
  106 + _taxonomy_field("fabric", "Fabric", "fabric type only, e.g. denim, knit, chiffon, jersey, fleece, cotton twill", "面料"),
  107 + _taxonomy_field("material_composition", "Material Composition", "fiber content or blend only if stated, e.g. cotton, polyester, spandex, linen blend, 95% cotton 5% elastane", "成分"),
  108 + _taxonomy_field("fabric_properties", "Fabric Properties", "inherent fabric traits, e.g. stretch, breathable, lightweight, soft-touch, water-resistant", "面料特性"),
  109 + _taxonomy_field("clothing_features", "Clothing Features", "product features, e.g. lined, reversible, hooded, packable, padded, pocketed", "服装特征"),
  110 + _taxonomy_field("functional_benefits", "Functional Benefits", "wearer benefits, e.g. moisture-wicking, thermal insulation, UV protection, easy care, supportive compression", "功能"),
  111 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  112 + _taxonomy_field("color_family", "Color Family", "normalized broad retail color group, e.g. black, white, blue, green, red, pink, beige, brown, gray", "色系"),
  113 + _taxonomy_field("print_pattern", "Print / Pattern", "surface pattern when applicable, e.g. solid, striped, plaid, floral, graphic, animal print", "印花 / 图案"),
  114 + _taxonomy_field("occasion_end_use", "Occasion / End Use", "likely use occasion only if supported, e.g. office, casual wear, streetwear, lounge, workout, outdoor", "适用场景"),
  115 + _taxonomy_field("style_aesthetic", "Style Aesthetic", "overall style only if supported, e.g. minimalist, streetwear, athleisure, smart casual, romantic, playful", "风格"),
  116 +)
  117 +
  118 +THREE_C_TAXONOMY_FIELDS = (
  119 + _taxonomy_field("product_type", "Product Type", "concise 3C accessory or peripheral category label", "品类"),
  120 + _taxonomy_field("compatible_device", "Compatible Device / Model", "supported device family, series, model, or form factor when clearly stated", "适配设备 / 型号"),
  121 + _taxonomy_field("connectivity", "Connectivity", "connection method such as wired, wireless, Bluetooth, Wi-Fi, NFC, or 2.4G", "连接方式"),
  122 + _taxonomy_field("interface_port_type", "Interface / Port Type", "relevant connector or port, e.g. USB-C, Lightning, HDMI, AUX, RJ45", "接口 / 端口类型"),
  123 + _taxonomy_field("power_charging", "Power Source / Charging", "charging or power mode, e.g. battery powered, fast charging, rechargeable, plug-in", "供电 / 充电方式"),
  124 + _taxonomy_field("key_features", "Key Features", "primary hardware features such as noise cancelling, foldable, magnetic, backlit, waterproof", "关键特征"),
  125 + _taxonomy_field("material_finish", "Material / Finish", "main material or exterior finish when supported", "材质 / 表面处理"),
  126 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  127 + _taxonomy_field("pack_size", "Pack Size", "unit count or bundle size when stated", "包装规格"),
  128 + _taxonomy_field("use_case", "Use Case", "intended usage such as travel, office, gaming, car, charging, streaming", "使用场景"),
  129 +)
  130 +
  131 +BAGS_TAXONOMY_FIELDS = (
  132 + _taxonomy_field("product_type", "Product Type", "concise bag category such as backpack, tote bag, crossbody bag, luggage, or wallet", "品类"),
  133 + _taxonomy_field("target_gender", "Target Gender", "intended gender only if clearly implied", "目标性别"),
  134 + _taxonomy_field("carry_style", "Carry Style", "how the bag is worn or carried, e.g. handheld, shoulder, crossbody, backpack", "携带方式"),
  135 + _taxonomy_field("size_capacity", "Size / Capacity", "size tier or capacity when supported, e.g. mini, large capacity, 20L", "尺寸 / 容量"),
  136 + _taxonomy_field("material", "Material", "main bag material such as leather, nylon, canvas, PU, straw", "材质"),
  137 + _taxonomy_field("closure_type", "Closure Type", "bag closure such as zipper, flap, buckle, drawstring, magnetic snap", "闭合方式"),
  138 + _taxonomy_field("structure_compartments", "Structure / Compartments", "organizational structure such as multi-pocket, laptop sleeve, card slots, expandable", "结构 / 分层"),
  139 + _taxonomy_field("strap_handle_type", "Strap / Handle Type", "strap or handle design such as chain strap, top handle, adjustable strap", "肩带 / 提手类型"),
  140 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  141 + _taxonomy_field("occasion_end_use", "Occasion / End Use", "likely use such as commute, travel, evening, school, casual", "适用场景"),
  142 +)
  143 +
  144 +PET_SUPPLIES_TAXONOMY_FIELDS = (
  145 + _taxonomy_field("product_type", "Product Type", "concise pet supplies category label", "品类"),
  146 + _taxonomy_field("pet_type", "Pet Type", "target pet such as dog, cat, bird, fish, hamster", "宠物类型"),
  147 + _taxonomy_field("breed_size", "Breed Size", "pet size or breed size when stated, e.g. small breed, large dogs", "体型 / 品种大小"),
  148 + _taxonomy_field("life_stage", "Life Stage", "pet age stage when supported, e.g. puppy, kitten, adult, senior", "成长阶段"),
  149 + _taxonomy_field("material_ingredients", "Material / Ingredients", "main material or ingredient composition when supported", "材质 / 成分"),
  150 + _taxonomy_field("flavor_scent", "Flavor / Scent", "flavor or scent when applicable", "口味 / 气味"),
  151 + _taxonomy_field("key_features", "Key Features", "primary attributes such as interactive, leak-proof, orthopedic, washable, elevated", "关键特征"),
  152 + _taxonomy_field("functional_benefits", "Functional Benefits", "benefits such as dental care, calming, digestion support, joint support", "功能"),
  153 + _taxonomy_field("size_capacity", "Size / Capacity", "size, count, or net content when stated", "尺寸 / 容量"),
  154 + _taxonomy_field("use_scenario", "Use Scenario", "usage such as feeding, training, grooming, travel, indoor play", "使用场景"),
  155 +)
  156 +
  157 +ELECTRONICS_TAXONOMY_FIELDS = (
  158 + _taxonomy_field("product_type", "Product Type", "concise electronics device or component category label", "品类"),
  159 + _taxonomy_field("device_category", "Device Category / Compatibility", "supported platform, component class, or compatible device family when stated", "设备类别 / 兼容性"),
  160 + _taxonomy_field("power_voltage", "Power / Voltage", "power, voltage, wattage, or battery spec when supported", "功率 / 电压"),
  161 + _taxonomy_field("connectivity", "Connectivity", "connection method such as wired, Bluetooth, Wi-Fi, RF, or smart app control", "连接方式"),
  162 + _taxonomy_field("interface_port_type", "Interface / Port Type", "relevant port or interface such as USB-C, AC plug type, HDMI, SATA", "接口 / 端口类型"),
  163 + _taxonomy_field("capacity_storage", "Capacity / Storage", "capacity or storage spec such as 256GB, 2TB, 5000mAh", "容量 / 存储"),
  164 + _taxonomy_field("key_features", "Key Features", "main product features such as touch control, HD display, noise reduction, smart control", "关键特征"),
  165 + _taxonomy_field("material_finish", "Material / Finish", "main housing material or finish when supported", "材质 / 表面处理"),
  166 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  167 + _taxonomy_field("use_case", "Use Case", "intended use such as home entertainment, office, charging, security, repair", "使用场景"),
  168 +)
  169 +
  170 +OUTDOOR_TAXONOMY_FIELDS = (
  171 + _taxonomy_field("product_type", "Product Type", "concise outdoor gear category label", "品类"),
  172 + _taxonomy_field("activity_type", "Activity Type", "primary outdoor activity such as camping, hiking, fishing, climbing, travel", "活动类型"),
  173 + _taxonomy_field("season_weather", "Season / Weather", "season or weather suitability when supported", "适用季节 / 天气"),
  174 + _taxonomy_field("material", "Material", "main material such as aluminum, ripstop nylon, stainless steel, EVA", "材质"),
  175 + _taxonomy_field("capacity_size", "Capacity / Size", "size, length, or capacity when stated", "容量 / 尺寸"),
  176 + _taxonomy_field("protection_resistance", "Protection / Resistance", "resistance or protection such as waterproof, UV resistant, windproof", "防护 / 耐受性"),
  177 + _taxonomy_field("key_features", "Key Features", "primary gear attributes such as foldable, lightweight, insulated, non-slip", "关键特征"),
  178 + _taxonomy_field("portability_packability", "Portability / Packability", "carry or storage trait such as collapsible, compact, ultralight, packable", "便携 / 收纳性"),
  179 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  180 + _taxonomy_field("use_scenario", "Use Scenario", "likely use setting such as campsite, trail, survival kit, beach, picnic", "使用场景"),
  181 +)
  182 +
  183 +HOME_APPLIANCES_TAXONOMY_FIELDS = (
  184 + _taxonomy_field("product_type", "Product Type", "concise home appliance category label", "品类"),
  185 + _taxonomy_field("appliance_category", "Appliance Category", "functional class such as kitchen appliance, cleaning appliance, personal care appliance", "家电类别"),
  186 + _taxonomy_field("power_voltage", "Power / Voltage", "wattage, voltage, plug type, or power supply when supported", "功率 / 电压"),
  187 + _taxonomy_field("capacity_coverage", "Capacity / Coverage", "capacity or coverage metric such as 1.5L, 20L, 40sqm", "容量 / 覆盖范围"),
  188 + _taxonomy_field("control_method", "Control Method", "operation method such as touch, knob, remote, app control", "控制方式"),
  189 + _taxonomy_field("installation_type", "Installation Type", "setup style such as countertop, handheld, portable, wall-mounted, built-in", "安装方式"),
  190 + _taxonomy_field("key_features", "Key Features", "main product features such as timer, steam, HEPA filter, self-cleaning", "关键特征"),
  191 + _taxonomy_field("material_finish", "Material / Finish", "main material or exterior finish when supported", "材质 / 表面处理"),
  192 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  193 + _taxonomy_field("use_scenario", "Use Scenario", "intended use such as cooking, cleaning, grooming, cooling, air treatment", "使用场景"),
  194 +)
  195 +
  196 +HOME_LIVING_TAXONOMY_FIELDS = (
  197 + _taxonomy_field("product_type", "Product Type", "concise home and living category label", "品类"),
  198 + _taxonomy_field("room_placement", "Room / Placement", "intended room or placement such as bedroom, kitchen, bathroom, desktop", "适用空间 / 摆放位置"),
  199 + _taxonomy_field("material", "Material", "main material such as wood, ceramic, cotton, glass, metal", "材质"),
  200 + _taxonomy_field("style", "Style", "home style such as modern, farmhouse, minimalist, boho, Nordic", "风格"),
  201 + _taxonomy_field("size_dimensions", "Size / Dimensions", "size or dimensions when stated", "尺寸 / 规格"),
  202 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  203 + _taxonomy_field("pattern_finish", "Pattern / Finish", "surface pattern or finish such as solid, marble, matte, ribbed", "图案 / 表面处理"),
  204 + _taxonomy_field("key_features", "Key Features", "main product features such as stackable, washable, blackout, space-saving", "关键特征"),
  205 + _taxonomy_field("assembly_installation", "Assembly / Installation", "assembly or installation trait when supported", "组装 / 安装"),
  206 + _taxonomy_field("use_scenario", "Use Scenario", "intended use such as storage, dining, decor, sleep, organization", "使用场景"),
  207 +)
  208 +
  209 +WIGS_TAXONOMY_FIELDS = (
  210 + _taxonomy_field("product_type", "Product Type", "concise wig or hairpiece category label", "品类"),
  211 + _taxonomy_field("hair_material", "Hair Material", "hair material such as human hair, synthetic fiber, heat-resistant fiber", "发丝材质"),
  212 + _taxonomy_field("hair_texture", "Hair Texture", "texture or curl pattern such as straight, body wave, curly, kinky", "发质纹理"),
  213 + _taxonomy_field("hair_length", "Hair Length", "hair length when stated", "发长"),
  214 + _taxonomy_field("hair_color", "Hair Color", "specific hair color or blend when available", "发色"),
  215 + _taxonomy_field("cap_construction", "Cap Construction", "cap type such as full lace, lace front, glueless, U part", "帽网结构"),
  216 + _taxonomy_field("lace_area_part_type", "Lace Area / Part Type", "lace size or part style such as 13x4 lace, middle part, T part", "蕾丝面积 / 分缝类型"),
  217 + _taxonomy_field("density_volume", "Density / Volume", "hair density or fullness when supported", "密度 / 发量"),
  218 + _taxonomy_field("style_bang_type", "Style / Bang Type", "style cue such as bob, pixie, layered, with bangs", "款式 / 刘海类型"),
  219 + _taxonomy_field("occasion_end_use", "Occasion / End Use", "intended use such as daily wear, cosplay, protective style, party", "适用场景"),
  220 +)
  221 +
  222 +BEAUTY_TAXONOMY_FIELDS = (
  223 + _taxonomy_field("product_type", "Product Type", "concise beauty or cosmetics category label", "品类"),
  224 + _taxonomy_field("target_area", "Target Area", "target area such as face, lips, eyes, nails, hair, body", "适用部位"),
  225 + _taxonomy_field("skin_hair_type", "Skin Type / Hair Type", "suitable skin or hair type when supported", "肤质 / 发质"),
  226 + _taxonomy_field("finish_effect", "Finish / Effect", "cosmetic finish or effect such as matte, dewy, volumizing, brightening", "妆效 / 效果"),
  227 + _taxonomy_field("key_ingredients", "Key Ingredients", "notable ingredients when stated", "关键成分"),
  228 + _taxonomy_field("shade_color", "Shade / Color", "specific shade or color when available", "色号 / 颜色"),
  229 + _taxonomy_field("scent", "Scent", "fragrance or scent only when supported", "香味"),
  230 + _taxonomy_field("formulation", "Formulation", "product form such as cream, serum, powder, gel, stick", "剂型 / 形态"),
  231 + _taxonomy_field("functional_benefits", "Functional Benefits", "benefits such as hydration, anti-aging, long-wear, repair, sun protection", "功能"),
  232 + _taxonomy_field("use_scenario", "Use Scenario", "intended use such as daily routine, salon, travel, evening makeup", "使用场景"),
  233 +)
  234 +
  235 +ACCESSORIES_TAXONOMY_FIELDS = (
  236 + _taxonomy_field("product_type", "Product Type", "concise accessory category label such as necklace, watch, belt, hat, or sunglasses", "品类"),
  237 + _taxonomy_field("target_gender", "Target Gender", "intended gender only if clearly implied", "目标性别"),
  238 + _taxonomy_field("material", "Material", "main material such as alloy, leather, stainless steel, acetate, fabric", "材质"),
  239 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  240 + _taxonomy_field("pattern_finish", "Pattern / Finish", "surface treatment or style finish such as polished, textured, braided, rhinestone", "图案 / 表面处理"),
  241 + _taxonomy_field("closure_fastening", "Closure / Fastening", "fastening method when applicable", "闭合 / 固定方式"),
  242 + _taxonomy_field("size_fit", "Size / Fit", "size or fit information such as adjustable, one size, 42mm", "尺寸 / 适配"),
  243 + _taxonomy_field("style", "Style", "style cue such as minimalist, vintage, statement, sporty", "风格"),
  244 + _taxonomy_field("occasion_end_use", "Occasion / End Use", "likely use such as daily wear, formal, party, travel, sun protection", "适用场景"),
  245 + _taxonomy_field("set_pack_size", "Set / Pack Size", "set count or pack size when stated", "套装 / 规格"),
  246 +)
  247 +
  248 +TOYS_TAXONOMY_FIELDS = (
  249 + _taxonomy_field("product_type", "Product Type", "concise toy category label", "品类"),
  250 + _taxonomy_field("age_group", "Age Group", "intended age group when clearly implied", "年龄段"),
  251 + _taxonomy_field("character_theme", "Character / Theme", "licensed character, theme, or play theme when supported", "角色 / 主题"),
  252 + _taxonomy_field("material", "Material", "main toy material such as plush, plastic, wood, silicone", "材质"),
  253 + _taxonomy_field("power_source", "Power Source", "battery, rechargeable, wind-up, or non-powered when supported", "供电方式"),
  254 + _taxonomy_field("interactive_features", "Interactive Features", "interactive functions such as sound, lights, remote control, motion", "互动功能"),
  255 + _taxonomy_field("educational_play_value", "Educational / Play Value", "play value such as STEM, pretend play, sensory, puzzle solving", "教育 / 可玩性"),
  256 + _taxonomy_field("piece_count_size", "Piece Count / Size", "piece count or size when stated", "件数 / 尺寸"),
  257 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  258 + _taxonomy_field("use_scenario", "Use Scenario", "intended use such as indoor play, bath time, party favor, outdoor play", "使用场景"),
  259 +)
  260 +
  261 +SHOES_TAXONOMY_FIELDS = (
  262 + _taxonomy_field("product_type", "Product Type", "concise footwear category label", "品类"),
  263 + _taxonomy_field("target_gender", "Target Gender", "intended gender only if clearly implied", "目标性别"),
  264 + _taxonomy_field("age_group", "Age Group", "only if clearly implied", "年龄段"),
  265 + _taxonomy_field("closure_type", "Closure Type", "fastening method such as lace-up, slip-on, buckle, hook-and-loop", "闭合方式"),
  266 + _taxonomy_field("toe_shape", "Toe Shape", "toe shape when applicable, e.g. round toe, pointed toe, open toe", "鞋头形状"),
  267 + _taxonomy_field("heel_sole_type", "Heel Height / Sole Type", "heel or sole profile such as flat, block heel, wedge, platform, thick sole", "跟高 / 鞋底类型"),
  268 + _taxonomy_field("upper_material", "Upper Material", "main upper material such as leather, knit, canvas, mesh", "鞋面材质"),
  269 + _taxonomy_field("lining_insole_material", "Lining / Insole Material", "lining or insole material when supported", "里料 / 鞋垫材质"),
  270 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  271 + _taxonomy_field("occasion_end_use", "Occasion / End Use", "likely use such as running, casual, office, hiking, formal", "适用场景"),
  272 +)
  273 +
  274 +SPORTS_TAXONOMY_FIELDS = (
  275 + _taxonomy_field("product_type", "Product Type", "concise sports product category label", "品类"),
  276 + _taxonomy_field("sport_activity", "Sport / Activity", "primary sport or activity such as fitness, yoga, basketball, cycling, swimming", "运动 / 活动"),
  277 + _taxonomy_field("skill_level", "Skill Level", "target user level when supported, e.g. beginner, training, professional", "适用水平"),
  278 + _taxonomy_field("material", "Material", "main material such as EVA, carbon fiber, neoprene, latex", "材质"),
  279 + _taxonomy_field("size_capacity", "Size / Capacity", "size, weight, resistance level, or capacity when stated", "尺寸 / 容量"),
  280 + _taxonomy_field("protection_support", "Protection / Support", "support or protection function such as ankle support, shock absorption, impact protection", "防护 / 支撑"),
  281 + _taxonomy_field("key_features", "Key Features", "main features such as anti-slip, adjustable, foldable, quick-dry", "关键特征"),
  282 + _taxonomy_field("power_source", "Power Source", "battery, electric, or non-powered when applicable", "供电方式"),
  283 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  284 + _taxonomy_field("use_scenario", "Use Scenario", "intended use such as gym, home workout, field training, competition", "使用场景"),
  285 +)
  286 +
  287 +OTHERS_TAXONOMY_FIELDS = (
  288 + _taxonomy_field("product_type", "Product Type", "concise product category label, not a full marketing title", "品类"),
  289 + _taxonomy_field("product_category", "Product Category", "broader retail grouping when the specific product type is narrow", "商品类别"),
  290 + _taxonomy_field("target_user", "Target User", "intended user, audience, or recipient when clearly implied", "适用人群"),
  291 + _taxonomy_field("material_ingredients", "Material / Ingredients", "main material or ingredients when supported", "材质 / 成分"),
  292 + _taxonomy_field("key_features", "Key Features", "primary product attributes or standout features", "关键特征"),
  293 + _taxonomy_field("functional_benefits", "Functional Benefits", "practical benefits or performance advantages when supported", "功能"),
  294 + _taxonomy_field("size_capacity", "Size / Capacity", "size, count, weight, or capacity when stated", "尺寸 / 容量"),
  295 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  296 + _taxonomy_field("style_theme", "Style / Theme", "overall style, design theme, or visual direction when supported", "风格 / 主题"),
  297 + _taxonomy_field("use_scenario", "Use Scenario", "likely use occasion or application setting when supported", "使用场景"),
  298 +)
  299 +
  300 +CATEGORY_TAXONOMY_PROFILES: Dict[str, Dict[str, Any]] = {
  301 + "apparel": _make_taxonomy_profile(
  302 + "apparel",
  303 + APPAREL_TAXONOMY_FIELDS,
  304 + ),
  305 + "3c": _make_taxonomy_profile(
  306 + "3C",
  307 + THREE_C_TAXONOMY_FIELDS,
  308 + ),
  309 + "bags": _make_taxonomy_profile(
  310 + "bags",
  311 + BAGS_TAXONOMY_FIELDS,
  312 + ),
  313 + "pet_supplies": _make_taxonomy_profile(
  314 + "pet supplies",
  315 + PET_SUPPLIES_TAXONOMY_FIELDS,
  316 + ),
  317 + "electronics": _make_taxonomy_profile(
  318 + "electronics",
  319 + ELECTRONICS_TAXONOMY_FIELDS,
  320 + ),
  321 + "outdoor": _make_taxonomy_profile(
  322 + "outdoor products",
  323 + OUTDOOR_TAXONOMY_FIELDS,
  324 + ),
  325 + "home_appliances": _make_taxonomy_profile(
  326 + "home appliances",
  327 + HOME_APPLIANCES_TAXONOMY_FIELDS,
  328 + ),
  329 + "home_living": _make_taxonomy_profile(
  330 + "home and living",
  331 + HOME_LIVING_TAXONOMY_FIELDS,
  332 + ),
  333 + "wigs": _make_taxonomy_profile(
  334 + "wigs",
  335 + WIGS_TAXONOMY_FIELDS,
  336 + ),
  337 + "beauty": _make_taxonomy_profile(
  338 + "beauty and cosmetics",
  339 + BEAUTY_TAXONOMY_FIELDS,
  340 + ),
  341 + "accessories": _make_taxonomy_profile(
  342 + "accessories",
  343 + ACCESSORIES_TAXONOMY_FIELDS,
  344 + ),
  345 + "toys": _make_taxonomy_profile(
  346 + "toys",
  347 + TOYS_TAXONOMY_FIELDS,
  348 + ),
  349 + "shoes": _make_taxonomy_profile(
  350 + "shoes",
  351 + SHOES_TAXONOMY_FIELDS,
  352 + ),
  353 + "sports": _make_taxonomy_profile(
  354 + "sports products",
  355 + SPORTS_TAXONOMY_FIELDS,
  356 + ),
  357 + "others": _make_taxonomy_profile(
  358 + "general merchandise",
  359 + OTHERS_TAXONOMY_FIELDS,
  360 + ),
  361 +}
  362 +
  363 +TAXONOMY_SHARED_ANALYSIS_INSTRUCTION = CATEGORY_TAXONOMY_PROFILES["apparel"]["shared_instruction"]
  364 +TAXONOMY_MARKDOWN_TABLE_HEADERS_EN = CATEGORY_TAXONOMY_PROFILES["apparel"]["markdown_table_headers"]["en"]
  365 +TAXONOMY_LANGUAGE_MARKDOWN_TABLE_HEADERS: Dict[str, Dict[str, Any]] = CATEGORY_TAXONOMY_PROFILES["apparel"]["markdown_table_headers"]
  366 +
36 367 LANGUAGE_MARKDOWN_TABLE_HEADERS: Dict[str, Dict[str, Any]] = {
37 368 "en": [
38 369 "No.",
... ...
indexer/product_enrich模块说明.md 0 → 100644
... ... @@ -0,0 +1,173 @@
  1 +# 内容富化模块说明
  2 +
  3 +本文说明商品内容富化模块的职责、入口、输出结构,以及当前 taxonomy profile 的设计约束。
  4 +
  5 +## 1. 模块目标
  6 +
  7 +内容富化模块负责基于商品文本调用 LLM,生成以下索引字段:
  8 +
  9 +- `qanchors`
  10 +- `enriched_tags`
  11 +- `enriched_attributes`
  12 +- `enriched_taxonomy_attributes`
  13 +
  14 +模块追求的设计原则:
  15 +
  16 +- 单一职责:只负责内容理解与结构化输出,不负责 CSV 读写
  17 +- 输出对齐 ES mapping:返回结构可直接写入 `search_products`
  18 +- 配置化扩展:taxonomy profile 通过数据配置扩展,而不是散落条件分支
  19 +- 代码精简:只面向正常使用方式,避免为了不合理调用堆叠补丁逻辑
  20 +
  21 +## 2. 主要文件
  22 +
  23 +- [product_enrich.py](/data/saas-search/indexer/product_enrich.py)
  24 + 运行时主逻辑,负责批处理、缓存、prompt 组装、LLM 调用、markdown 解析、输出整理
  25 +- [product_enrich_prompts.py](/data/saas-search/indexer/product_enrich_prompts.py)
  26 + prompt 模板与 taxonomy profile 配置
  27 +- [document_transformer.py](/data/saas-search/indexer/document_transformer.py)
  28 + 在内部索引构建链路中调用内容富化模块,把结果回填到 ES doc
  29 +- [taxonomy.md](/data/saas-search/indexer/taxonomy.md)
  30 + taxonomy 设计说明与字段清单
  31 +
  32 +## 3. 对外入口
  33 +
  34 +### 3.1 Python 入口
  35 +
  36 +核心入口:
  37 +
  38 +```python
  39 +build_index_content_fields(
  40 + items,
  41 + tenant_id=None,
  42 + enrichment_scopes=None,
  43 + category_taxonomy_profile=None,
  44 +)
  45 +```
  46 +
  47 +输入最小要求:
  48 +
  49 +- `id` 或 `spu_id`
  50 +- `title`
  51 +
  52 +可选输入:
  53 +
  54 +- `brief`
  55 +- `description`
  56 +- `image_url`
  57 +
  58 +关键参数:
  59 +
  60 +- `enrichment_scopes`
  61 + 可选 `generic`、`category_taxonomy`
  62 +- `category_taxonomy_profile`
  63 + taxonomy profile;默认 `apparel`
  64 +
  65 +### 3.2 HTTP 入口
  66 +
  67 +API 路由:
  68 +
  69 +- `POST /indexer/enrich-content`
  70 +
  71 +对应文档:
  72 +
  73 +- [搜索API对接指南-05-索引接口(Indexer)](/data/saas-search/docs/搜索API对接指南-05-索引接口(Indexer).md)
  74 +- [搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation)](/data/saas-search/docs/搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md)
  75 +
  76 +## 4. 输出结构
  77 +
  78 +返回结果与 ES mapping 对齐:
  79 +
  80 +```json
  81 +{
  82 + "id": "223167",
  83 + "qanchors": {
  84 + "zh": ["短袖T恤", "纯棉"],
  85 + "en": ["t-shirt", "cotton"]
  86 + },
  87 + "enriched_tags": {
  88 + "zh": ["短袖", "纯棉"],
  89 + "en": ["short sleeve", "cotton"]
  90 + },
  91 + "enriched_attributes": [
  92 + {
  93 + "name": "enriched_tags",
  94 + "value": {
  95 + "zh": ["短袖", "纯棉"],
  96 + "en": ["short sleeve", "cotton"]
  97 + }
  98 + }
  99 + ],
  100 + "enriched_taxonomy_attributes": [
  101 + {
  102 + "name": "Product Type",
  103 + "value": {
  104 + "zh": ["T恤"],
  105 + "en": ["t-shirt"]
  106 + }
  107 + }
  108 + ]
  109 +}
  110 +```
  111 +
  112 +说明:
  113 +
  114 +- `generic` 部分固定输出核心索引语言 `zh`、`en`
  115 +- `taxonomy` 部分同样统一输出 `zh`、`en`
  116 +
  117 +## 5. Taxonomy profile
  118 +
  119 +当前支持:
  120 +
  121 +- `apparel`
  122 +- `3c`
  123 +- `bags`
  124 +- `pet_supplies`
  125 +- `electronics`
  126 +- `outdoor`
  127 +- `home_appliances`
  128 +- `home_living`
  129 +- `wigs`
  130 +- `beauty`
  131 +- `accessories`
  132 +- `toys`
  133 +- `shoes`
  134 +- `sports`
  135 +- `others`
  136 +
  137 +统一约束:
  138 +
  139 +- 所有 profile 都返回 `zh` + `en`
  140 +- profile 只决定 taxonomy 字段集合,不再决定输出语言
  141 +- 所有 profile 都配置中英文字段名,prompt/header 结构保持一致
  142 +
  143 +## 6. 内部索引链路的当前约束
  144 +
  145 +在内部 ES 文档构建链路里,`document_transformer` 当前调用内容富化时,taxonomy profile 暂时固定传:
  146 +
  147 +```python
  148 +category_taxonomy_profile="apparel"
  149 +```
  150 +
  151 +这是一种显式、可控、代码更干净的临时策略。
  152 +
  153 +当前代码里已保留 TODO:
  154 +
  155 +- 后续从数据库读取租户真实所属行业
  156 +- 再用该行业替换固定的 `apparel`
  157 +
  158 +当前不做“根据商品类目文本自动猜 profile”的隐式逻辑,避免增加冗余代码与不必要的不确定性。
  159 +
  160 +## 7. 缓存与批处理
  161 +
  162 +缓存键由以下信息共同决定:
  163 +
  164 +- `analysis_kind`
  165 +- `target_lang`
  166 +- prompt/schema 版本指纹
  167 +- prompt 实际输入文本
  168 +
  169 +批处理规则:
  170 +
  171 +- 单次 LLM 调用最多 20 条
  172 +- 上层允许传更大批次,模块内部自动拆批
  173 +- uncached batch 可并发执行
... ...
indexer/taxonomy.md 0 → 100644
... ... @@ -0,0 +1,196 @@
  1 +
  2 +# Cross-Border E-commerce Core Categories 大类
  3 +
  4 +## 1. 3C
  5 +Phone accessories, computer peripherals, smart wearables, audio & video, smart home, gaming gear. 手机配件、电脑周边、智能穿戴、影音娱乐、智能家居、游戏设备。
  6 +
  7 +## 2. Bags 包
  8 +Handbags, backpacks, wallets, luggage, crossbody bags, tote bags. 手提包、双肩包、钱包、行李箱、斜挎包、托特包。
  9 +
  10 +## 3. Pet Supplies 宠物用品
  11 +Pet food, pet toys, pet care products, pet grooming, pet clothing, smart pet devices. 宠物食品、宠物玩具、宠物护理用品、宠物美容、宠物服装、智能宠物设备。
  12 +
  13 +## 4. Electronics 电子产品
  14 +Consumer electronics, home appliances, digital devices, cables & chargers, batteries, electronic components. 消费电子产品、家用电器、数码设备、线材充电器、电池、电子元器件。
  15 +
  16 +## 5. Clothing 服装
  17 +Women's wear, men's wear, kid's wear, underwear, outerwear, activewear. 女装、男装、童装、内衣、外套、运动服装。
  18 +
  19 +## 6. Outdoor 户外用品
  20 +Camping gear, hiking equipment, fishing supplies, outdoor clothing, travel accessories, survival tools. 露营装备、徒步用品、渔具、户外服装、旅行配件、求生工具。
  21 +
  22 +## 7. Home Appliances 家电/电器
  23 +Kitchen appliances, cleaning appliances, personal care appliances, heating & cooling, smart home devices. 厨房电器、清洁电器、个护电器、冷暖设备、智能家居设备。
  24 +
  25 +## 8. Home & Living 家居
  26 +Furniture, home textiles, lighting, kitchenware, storage, home decor. 家具、家纺、灯具、厨具、收纳、家居装饰。
  27 +
  28 +## 9. Wigs 假发
  29 +
  30 +## 10. Beauty & Cosmetics 美容美妆
  31 +Skincare, makeup, nail care, beauty tools, hair care, fragrances. 护肤品、彩妆、美甲、美容工具、护发、香水。
  32 +
  33 +## 11. Accessories 配饰
  34 +Jewelry, watches, belts, scarves, hats, sunglasses, hair accessories. 珠宝、手表、腰带、围巾、帽子、太阳镜、发饰。
  35 +
  36 +## 12. Toys 玩具
  37 +Educational toys, plush toys, action figures, puzzles, outdoor toys, DIY toys. 益智玩具、毛绒玩具、可动人偶、拼图、户外玩具、DIY玩具。
  38 +
  39 +## 13. Shoes 鞋子
  40 +Sneakers, boots, sandals, heels, flats, sports shoes. 运动鞋、靴子、凉鞋、高跟鞋、平底鞋、球鞋。
  41 +
  42 +## 14. Sports 运动产品
  43 +Fitness equipment, sports gear, team sports, racquet sports, water sports, cycling. 健身器材、运动装备、团队运动、球拍运动、水上运动、骑行。
  44 +
  45 +## 15. Others 其他
  46 +
  47 +# 各个大类的taxonomy
  48 +## 1. Clothing & Apparel 服装
  49 +
  50 +### A. Product Classification
  51 +
  52 +| 一级层级 | 中文列名 | English Column Name |
  53 +| ------------------------- | ---- | ------------------- |
  54 +| A. Product Classification | 品类 | Product Type |
  55 +| A. Product Classification | 目标性别 | Target Gender |
  56 +| A. Product Classification | 年龄段 | Age Group |
  57 +| A. Product Classification | 适用季节 | Season |
  58 +
  59 +### B. Garment Design
  60 +
  61 +| 一级层级 | 中文列名 | English Column Name |
  62 +| ----------------- | ---- | ------------------- |
  63 +| B. Garment Design | 版型 | Fit |
  64 +| B. Garment Design | 廓形 | Silhouette |
  65 +| B. Garment Design | 领型 | Neckline |
  66 +| B. Garment Design | 袖型 | Sleeve Style |
  67 +| B. Garment Design | 肩带设计 | Strap Type |
  68 +| B. Garment Design | 腰型 | Rise / Waistline |
  69 +| B. Garment Design | 裤型 | Leg Shape |
  70 +| B. Garment Design | 裙型 | Skirt Shape |
  71 +| B. Garment Design | 长度 | Length Type |
  72 +| B. Garment Design | 闭合方式 | Closure Type |
  73 +| B. Garment Design | 设计细节 | Design Details |
  74 +
  75 +### C. Material & Performance
  76 +
  77 +| 一级层级 | 中文列名 | English Column Name |
  78 +| ------------------------- | ----------- | -------------------- |
  79 +| C. Material & Performance | 面料 | Fabric |
  80 +| C. Material & Performance | 成分 | Material Composition |
  81 +| C. Material & Performance | 面料特性 | Fabric Properties |
  82 +| C. Material & Performance | 服装特征 / 功能细节 | Clothing Features |
  83 +| C. Material & Performance | 功能 | Functional Benefits |
  84 +
  85 +### D. Merchandising Attributes
  86 +
  87 +| 一级层级 | 中文列名 | English Column Name |
  88 +| --------------------------- | ------- | ------------------- |
  89 +| D. Merchandising Attributes | 主颜色 | Color |
  90 +| D. Merchandising Attributes | 色系 | Color Family |
  91 +| D. Merchandising Attributes | 印花 / 图案 | Print / Pattern |
  92 +| D. Merchandising Attributes | 适用场景 | Occasion / End Use |
  93 +| D. Merchandising Attributes | 风格 | Style Aesthetic |
  94 +
  95 +
  96 +
  97 +根据这个产生
  98 +enriched_taxonomy_attributes
  99 +
  100 +```python
  101 +Product Type
  102 +Target Gender
  103 +Age Group
  104 +Season
  105 +Fit
  106 +Silhouette
  107 +Neckline
  108 +Sleeve Length Type
  109 +Sleeve Style
  110 +Strap Type
  111 +Rise / Waistline
  112 +Leg Shape
  113 +Skirt Shape
  114 +Length Type
  115 +Closure Type
  116 +Design Details
  117 +Fabric
  118 +Material Composition
  119 +Fabric Properties
  120 +Clothing Features
  121 +Functional Benefits
  122 +Color
  123 +Color Family
  124 +Print / Pattern
  125 +Occasion / End Use
  126 +Style Aesthetic
  127 +```
  128 +
  129 +提示词:
  130 +
  131 +```python
  132 +SHARED_ANALYSIS_INSTRUCTION = """
  133 +Analyze each input product text and fill the columns below using an apparel attribute taxonomy.
  134 +
  135 +Output columns:
  136 +1. Product Type: concise ecommerce apparel category label, not a full marketing title
  137 +2. Target Gender: intended gender only if clearly implied
  138 +3. Age Group: only if clearly implied, e.g. adults, kids, teens, toddlers, babies
  139 +4. Season: season(s) or all-season suitability only if supported
  140 +5. Fit: body closeness, e.g. slim, regular, relaxed, oversized, fitted
  141 +6. Silhouette: overall garment shape, e.g. straight, A-line, boxy, tapered, bodycon, wide-leg
  142 +7. Neckline: neckline type when applicable, e.g. crew neck, V-neck, hooded, collared, square neck
  143 +8. Sleeve Length Type: sleeve length only, e.g. sleeveless, short sleeve, long sleeve, three-quarter sleeve
  144 +9. Sleeve Style: sleeve design only, e.g. puff sleeve, raglan sleeve, batwing sleeve, bell sleeve
  145 +10. Strap Type: strap design when applicable, e.g. spaghetti strap, wide strap, halter strap, adjustable strap
  146 +11. Rise / Waistline: waist placement when applicable, e.g. high rise, mid rise, low rise, empire waist
  147 +12. Leg Shape: for bottoms only, e.g. straight leg, wide leg, flare leg, tapered leg, skinny leg
  148 +13. Skirt Shape: for skirts only, e.g. A-line, pleated, pencil, mermaid
  149 +14. Length Type: design length only, not size, e.g. cropped, regular, longline, mini, midi, maxi, ankle length, full length
  150 +15. Closure Type: fastening method when applicable, e.g. zipper, button, drawstring, elastic waist, hook-and-loop
  151 +16. Design Details: construction or visual details, e.g. ruched, ruffled, pleated, cut-out, layered, distressed, split hem
  152 +17. Fabric: fabric type only, e.g. denim, knit, chiffon, jersey, fleece, cotton twill
  153 +18. Material Composition: fiber content or blend only if stated, e.g. cotton, polyester, spandex, linen blend, 95% cotton 5% elastane
  154 +19. Fabric Properties: inherent fabric traits, e.g. stretch, breathable, lightweight, soft-touch, water-resistant
  155 +20. Clothing Features: product features, e.g. lined, reversible, hooded, packable, padded, pocketed
  156 +21. Functional Benefits: wearer benefits, e.g. moisture-wicking, thermal insulation, UV protection, easy care, supportive compression
  157 +22. Color: specific color name when available
  158 +23. Color Family: normalized broad retail color group, e.g. black, white, blue, green, red, pink, beige, brown, gray
  159 +24. Print / Pattern: surface pattern when applicable, e.g. solid, striped, plaid, floral, graphic, animal print
  160 +25. Occasion / End Use: likely use occasion only if supported, e.g. office, casual wear, streetwear, lounge, workout, outdoor
  161 +26. Style Aesthetic: overall style only if supported, e.g. minimalist, streetwear, athleisure, smart casual, romantic, playful
  162 +
  163 +Rules:
  164 +- Keep the same row order and row count as input.
  165 +- Infer only from the provided product text.
  166 +- Leave blank if not applicable or not reasonably supported.
  167 +- Use concise, standardized English ecommerce wording.
  168 +- Do not combine different attribute dimensions in one field.
  169 +- If multiple values are needed, use the delimiter required by the localization setting.
  170 +
  171 +Input product list:
  172 +"""
  173 +```
  174 +
  175 +## 2. Other taxonomy profiles
  176 +
  177 +说明:
  178 +- 所有 profile 统一返回 `zh` + `en`。
  179 +- 代码中的 profile slug 与下面保持一致。
  180 +
  181 +| Profile | Core columns (`en`) |
  182 +| --- | --- |
  183 +| `3c` | Product Type, Compatible Device / Model, Connectivity, Interface / Port Type, Power Source / Charging, Key Features, Material / Finish, Color, Pack Size, Use Case |
  184 +| `bags` | Product Type, Target Gender, Carry Style, Size / Capacity, Material, Closure Type, Structure / Compartments, Strap / Handle Type, Color, Occasion / End Use |
  185 +| `pet_supplies` | Product Type, Pet Type, Breed Size, Life Stage, Material / Ingredients, Flavor / Scent, Key Features, Functional Benefits, Size / Capacity, Use Scenario |
  186 +| `electronics` | Product Type, Device Category / Compatibility, Power / Voltage, Connectivity, Interface / Port Type, Capacity / Storage, Key Features, Material / Finish, Color, Use Case |
  187 +| `outdoor` | Product Type, Activity Type, Season / Weather, Material, Capacity / Size, Protection / Resistance, Key Features, Portability / Packability, Color, Use Scenario |
  188 +| `home_appliances` | Product Type, Appliance Category, Power / Voltage, Capacity / Coverage, Control Method, Installation Type, Key Features, Material / Finish, Color, Use Scenario |
  189 +| `home_living` | Product Type, Room / Placement, Material, Style, Size / Dimensions, Color, Pattern / Finish, Key Features, Assembly / Installation, Use Scenario |
  190 +| `wigs` | Product Type, Hair Material, Hair Texture, Hair Length, Hair Color, Cap Construction, Lace Area / Part Type, Density / Volume, Style / Bang Type, Occasion / End Use |
  191 +| `beauty` | Product Type, Target Area, Skin Type / Hair Type, Finish / Effect, Key Ingredients, Shade / Color, Scent, Formulation, Functional Benefits, Use Scenario |
  192 +| `accessories` | Product Type, Target Gender, Material, Color, Pattern / Finish, Closure / Fastening, Size / Fit, Style, Occasion / End Use, Set / Pack Size |
  193 +| `toys` | Product Type, Age Group, Character / Theme, Material, Power Source, Interactive Features, Educational / Play Value, Piece Count / Size, Color, Use Scenario |
  194 +| `shoes` | Product Type, Target Gender, Age Group, Closure Type, Toe Shape, Heel Height / Sole Type, Upper Material, Lining / Insole Material, Color, Occasion / End Use |
  195 +| `sports` | Product Type, Sport / Activity, Skill Level, Material, Size / Capacity, Protection / Support, Key Features, Power Source, Color, Use Scenario |
  196 +| `others` | Product Type, Product Category, Target User, Material / Ingredients, Key Features, Functional Benefits, Size / Capacity, Color, Style / Theme, Use Scenario |
... ...
mappings/README.md
... ... @@ -68,6 +68,7 @@
68 68 - `option2_values`
69 69 - `option3_values`
70 70 - `enriched_attributes.value`
  71 +- `enriched_taxonomy_attributes.value`
71 72 - `specifications.value_text`
72 73  
73 74 以 `category_path` 和 `option*_values` 为例,核心语言灌入结果应至少包含:
... ...
mappings/generate_search_products_mapping.py
... ... @@ -214,6 +214,11 @@ FIELD_SPECS = [
214 214 scalar_field("name", "keyword"),
215 215 text_field("value", "core_language_text_with_keyword"),
216 216 ),
  217 + nested_field(
  218 + "enriched_taxonomy_attributes",
  219 + scalar_field("name", "keyword"),
  220 + text_field("value", "core_language_text_with_keyword"),
  221 + ),
217 222 scalar_field("option1_name", "keyword"),
218 223 scalar_field("option2_name", "keyword"),
219 224 scalar_field("option3_name", "keyword"),
... ...
mappings/search_products.json
... ... @@ -2116,6 +2116,40 @@
2116 2116 }
2117 2117 }
2118 2118 },
  2119 + "enriched_taxonomy_attributes": {
  2120 + "type": "nested",
  2121 + "properties": {
  2122 + "name": {
  2123 + "type": "keyword"
  2124 + },
  2125 + "value": {
  2126 + "type": "object",
  2127 + "properties": {
  2128 + "zh": {
  2129 + "type": "text",
  2130 + "analyzer": "index_ik",
  2131 + "search_analyzer": "query_ik",
  2132 + "fields": {
  2133 + "keyword": {
  2134 + "type": "keyword",
  2135 + "normalizer": "lowercase"
  2136 + }
  2137 + }
  2138 + },
  2139 + "en": {
  2140 + "type": "text",
  2141 + "analyzer": "english",
  2142 + "fields": {
  2143 + "keyword": {
  2144 + "type": "keyword",
  2145 + "normalizer": "lowercase"
  2146 + }
  2147 + }
  2148 + }
  2149 + }
  2150 + }
  2151 + }
  2152 + },
2119 2153 "option1_name": {
2120 2154 "type": "keyword"
2121 2155 },
... ...
perf_reports/20260311/reranker_1000docs/report.md
... ... @@ -34,5 +34,5 @@ Workload profile:
34 34 ## Reproduce
35 35  
36 36 ```bash
37   -./scripts/benchmark_reranker_1000docs.sh
  37 +./benchmarks/reranker/benchmark_reranker_1000docs.sh
38 38 ```
... ...
perf_reports/20260317/translation_local_models/README.md
1 1 # Local Translation Model Benchmark Report
2 2  
3   -Test script: [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py)
  3 +Test script: [`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py)
4 4  
5 5 Test time: `2026-03-17`
6 6  
... ... @@ -67,7 +67,7 @@ To model online search query translation, we reran NLLB with `batch_size=1`. In
67 67 Command used:
68 68  
69 69 ```bash
70   -./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
  70 +./.venv-translator/bin/python benchmarks/translation/benchmark_translation_local_models.py \
71 71 --single \
72 72 --model nllb-200-distilled-600m \
73 73 --source-lang zh \
... ...
perf_reports/20260318/nllb_t4_product_names_ct2/README.md
1 1 # NLLB T4 Product-Name Tuning Summary
2 2  
3 3 测试脚本:
4   -- [`scripts/benchmark_nllb_t4_tuning.py`](/data/saas-search/scripts/benchmark_nllb_t4_tuning.py)
  4 +- [`benchmarks/translation/benchmark_nllb_t4_tuning.py`](/data/saas-search/benchmarks/translation/benchmark_nllb_t4_tuning.py)
5 5  
6 6 本轮报告:
7 7 - Markdown:[`nllb_t4_tuning_003608.md`](/data/saas-search/perf_reports/20260318/nllb_t4_product_names_ct2/nllb_t4_tuning_003608.md)
... ...
perf_reports/20260318/translation_local_models/README.md
1 1 # Local Translation Model Benchmark Report
2 2  
3 3 测试脚本:
4   -- [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py)
  4 +- [`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py)
5 5  
6 6 完整结果:
7 7 - Markdown:[`translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md)
... ... @@ -39,7 +39,7 @@
39 39  
40 40 ```bash
41 41 cd /data/saas-search
42   -./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
  42 +./.venv-translator/bin/python benchmarks/translation/benchmark_translation_local_models.py \
43 43 --suite extended \
44 44 --disable-cache \
45 45 --serial-items-per-case 256 \
... ...
perf_reports/20260318/translation_local_models_ct2/README.md
1 1 # Local Translation Model Benchmark Report (CTranslate2)
2 2  
3 3 测试脚本:
4   -- [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py)
  4 +- [`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py)
5 5  
6 6 本轮 CT2 结果:
7 7 - Markdown:[`translation_local_models_ct2_extended_233253.md`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2/translation_local_models_ct2_extended_233253.md)
... ... @@ -46,7 +46,7 @@ from datetime import datetime
46 46 from pathlib import Path
47 47 from types import SimpleNamespace
48 48  
49   -from scripts.benchmark_translation_local_models import (
  49 +from benchmarks.translation.benchmark_translation_local_models import (
50 50 SCENARIOS,
51 51 benchmark_extended_scenario,
52 52 build_environment_info,
... ...
perf_reports/20260318/translation_local_models_ct2_focus/README.md
1 1 # Local Translation Model Focused T4 Tuning
2 2  
3 3 测试脚本:
4   -- [`scripts/benchmark_translation_local_models_focus.py`](/data/saas-search/scripts/benchmark_translation_local_models_focus.py)
  4 +- [`benchmarks/translation/benchmark_translation_local_models_focus.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models_focus.py)
5 5  
6 6 本轮聚焦结果:
7 7 - Markdown:[`translation_local_models_focus_235018.md`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2_focus/translation_local_models_focus_235018.md)
... ...
perf_reports/README.md
... ... @@ -4,7 +4,7 @@
4 4  
5 5 | 脚本 | 用途 |
6 6 |------|------|
7   -| `scripts/perf_api_benchmark.py` | 搜索后端、向量、翻译、重排等 HTTP 接口压测;支持 `--embed-text-priority` / `--embed-image-priority` 与 `scripts/perf_cases.json.example` |
  7 +| `benchmarks/perf_api_benchmark.py` | 搜索后端、向量、翻译、重排等 HTTP 接口压测;支持 `--embed-text-priority` / `--embed-image-priority` 与 `benchmarks/perf_cases.json.example` |
8 8  
9 9 历史矩阵示例(并发扫描):
10 10  
... ... @@ -25,10 +25,10 @@
25 25  
26 26 ```bash
27 27 source activate.sh
28   -python scripts/perf_api_benchmark.py --scenario embed_text --duration 8 --concurrency 10 --timeout 30 --output perf_reports/2026-03-20_embed_text_p0.json
29   -python scripts/perf_api_benchmark.py --scenario embed_text --duration 8 --concurrency 10 --embed-text-priority 1 --output perf_reports/2026-03-20_embed_text_p1.json
30   -python scripts/perf_api_benchmark.py --scenario embed_image --duration 8 --concurrency 5 --timeout 60 --output perf_reports/2026-03-20_embed_image_p0.json
31   -python scripts/perf_api_benchmark.py --scenario embed_image --duration 8 --concurrency 5 --embed-image-priority 1 --output perf_reports/2026-03-20_embed_image_p1.json
  28 +python benchmarks/perf_api_benchmark.py --scenario embed_text --duration 8 --concurrency 10 --timeout 30 --output perf_reports/2026-03-20_embed_text_p0.json
  29 +python benchmarks/perf_api_benchmark.py --scenario embed_text --duration 8 --concurrency 10 --embed-text-priority 1 --output perf_reports/2026-03-20_embed_text_p1.json
  30 +python benchmarks/perf_api_benchmark.py --scenario embed_image --duration 8 --concurrency 5 --timeout 60 --output perf_reports/2026-03-20_embed_image_p0.json
  31 +python benchmarks/perf_api_benchmark.py --scenario embed_image --duration 8 --concurrency 5 --embed-image-priority 1 --output perf_reports/2026-03-20_embed_image_p1.json
32 32 ```
33 33  
34 34 说明:本次为 **8 秒 smoke**,与 `2026-03-12` 矩阵的时长/并发不可直接横向对比;仅验证 `priority` 参数下服务仍返回 200 且 payload 校验通过。
... ...
perf_reports/reranker_vllm_instruction/2026-03-25/RESULTS.md
... ... @@ -25,7 +25,7 @@ Shared across both backends for this run:
25 25  
26 26 ## Methodology
27 27  
28   -- Script: `python scripts/benchmark_reranker_random_titles.py 100,200,400,600,800,1000 --repeat 5` with **`--seed 99`** (see note below), **`--quiet-runs`**, **`--timeout 360`**.
  28 +- Script: `python benchmarks/reranker/benchmark_reranker_random_titles.py 100,200,400,600,800,1000 --repeat 5` with **`--seed 99`** (see note below), **`--quiet-runs`**, **`--timeout 360`**.
29 29 - Titles: default file `/home/ubuntu/rerank_test/titles.1.8w` (one title per line).
30 30 - Query: default `健身女生T恤短袖`.
31 31 - Each scenario: **3 warm-up** requests at `n=400` (not timed), then **5 timed** runs per `n`.
... ... @@ -56,9 +56,9 @@ JSON aggregates (means, stdev, raw `values_ms`): same directory, `qwen3_vllm_{co
56 56 ## Tooling added / changed
57 57  
58 58 - `reranker/server.py`: `/health` includes `instruction_format` when the active backend sets `_instruction_format`.
59   -- `scripts/benchmark_reranker_random_titles.py`: `--tag`, `--json-summary-out`, `--quiet-runs`.
60   -- `scripts/patch_rerank_vllm_benchmark_config.py`: surgical YAML patch (preserves newlines).
61   -- `scripts/run_reranker_vllm_instruction_benchmark.sh`: full matrix driver (continues if a benchmark exits non-zero; uses `--timeout 360`).
  59 +- `benchmarks/reranker/benchmark_reranker_random_titles.py`: `--tag`, `--json-summary-out`, `--quiet-runs`.
  60 +- `benchmarks/reranker/patch_rerank_vllm_benchmark_config.py`: surgical YAML patch (preserves newlines).
  61 +- `benchmarks/reranker/run_reranker_vllm_instruction_benchmark.sh`: full matrix driver (continues if a benchmark exits non-zero; uses `--timeout 360`).
62 62  
63 63 ---
64 64  
... ... @@ -73,7 +73,7 @@ JSON aggregates (means, stdev, raw `values_ms`): same directory, `qwen3_vllm_{co
73 73 | Attention | Backend forced / steered attention on T4 (e.g. `TRITON_ATTN` path) | **No** `attention_config` in `LLM(...)`; vLLM **auto** — on this T4 run, logs show **`FLASHINFER`** |
74 74 | Config surface | `vllm_attention_backend` / `RERANK_VLLM_ATTENTION_BACKEND` 等 | **Removed**(少 YAML/环境变量分支,逻辑收敛) |
75 75 | Code default `instruction_format` | `qwen3_vllm_score` 默认 `standard` | 与 `qwen3_vllm` 对齐为 **`compact`**(仍可在 YAML 写 `standard`) |
76   -| Smoke / 启动 | — | `scripts/smoke_qwen3_vllm_score_backend.py`;`scripts/start_reranker.sh` 将 **venv `bin` 置于 `PATH`**(FLASHINFER JIT 依赖 venv 内的 `ninja`) |
  76 +| Smoke / 启动 | — | `benchmarks/reranker/smoke_qwen3_vllm_score_backend.py`;`scripts/start_reranker.sh` 将 **venv `bin` 置于 `PATH`**(FLASHINFER JIT 依赖 venv 内的 `ninja`) |
77 77  
78 78 Micro-benchmark (same machine, isolated): **~927.5 ms → ~673.1 ms** at **n=400** docs on `LLM.score()` steady state (~**28%**), after removing the forced attention path and letting vLLM pick **FLASHINFER**.
79 79  
... ...
requirements_translator_service.txt
... ... @@ -13,7 +13,8 @@ httpx&gt;=0.24.0
13 13 tqdm>=4.65.0
14 14  
15 15 torch>=2.0.0
16   -transformers>=4.30.0
  16 +# Keep translator conversions on the last verified NLLB-compatible release line.
  17 +transformers>=4.51.0,<4.52.0
17 18 ctranslate2>=4.7.0
18 19 sentencepiece>=0.2.0
19 20 sacremoses>=0.1.1
... ...
reranker/DEPLOYMENT_AND_TUNING.md
... ... @@ -109,7 +109,7 @@ curl -sS http://127.0.0.1:6007/health
109 109 ### 5.1 使用一键压测脚本
110 110  
111 111 ```bash
112   -./scripts/benchmark_reranker_1000docs.sh
  112 +./benchmarks/reranker/benchmark_reranker_1000docs.sh
113 113 ```
114 114  
115 115 输出目录:
... ...
reranker/GGUF_0_6B_INSTALL_AND_TUNING.md
... ... @@ -144,7 +144,7 @@ qwen3_gguf_06b:
144 144  
145 145 ```bash
146 146 PYTHONPATH=/data/saas-search ./.venv-reranker-gguf/bin/python \
147   - scripts/benchmark_reranker_gguf_local.py --backend-name qwen3_gguf_06b --docs 400
  147 + benchmarks/reranker/benchmark_reranker_gguf_local.py --backend-name qwen3_gguf_06b --docs 400
148 148 ```
149 149  
150 150 按服务方式启动:
... ...
reranker/GGUF_INSTALL_AND_TUNING.md
... ... @@ -117,7 +117,7 @@ HF_HUB_DISABLE_XET=1
117 117  
118 118 ```bash
119 119 PYTHONPATH=/data/saas-search ./.venv-reranker-gguf/bin/python \
120   - scripts/benchmark_reranker_gguf_local.py --docs 64 --repeat 1
  120 + benchmarks/reranker/benchmark_reranker_gguf_local.py --docs 64 --repeat 1
121 121 ```
122 122  
123 123 它会直接实例化 GGUF backend,输出:
... ... @@ -134,7 +134,7 @@ PYTHONPATH=/data/saas-search ./.venv-reranker-gguf/bin/python \
134 134  
135 135 - Query: `白色oversized T-shirt`
136 136 - Docs: `64` 条商品标题
137   -- 本地脚本:`scripts/benchmark_reranker_gguf_local.py`
  137 +- 本地脚本:`benchmarks/reranker/benchmark_reranker_gguf_local.py`
138 138 - 每组 1 次,重点比较相对趋势
139 139  
140 140 结果:
... ... @@ -195,7 +195,7 @@ n_gpu_layers=999
195 195  
196 196 ```bash
197 197 RERANK_BASE=http://127.0.0.1:6007 \
198   - ./.venv/bin/python scripts/benchmark_reranker_random_titles.py 64 --repeat 1 --query '白色oversized T-shirt'
  198 + ./.venv/bin/python benchmarks/reranker/benchmark_reranker_random_titles.py 64 --repeat 1 --query '白色oversized T-shirt'
199 199 ```
200 200  
201 201 得到:
... ... @@ -206,7 +206,7 @@ RERANK_BASE=http://127.0.0.1:6007 \
206 206  
207 207 ```bash
208 208 RERANK_BASE=http://127.0.0.1:6007 \
209   - ./.venv/bin/python scripts/benchmark_reranker_random_titles.py 153 --repeat 1 --query '白色oversized T-shirt'
  209 + ./.venv/bin/python benchmarks/reranker/benchmark_reranker_random_titles.py 153 --repeat 1 --query '白色oversized T-shirt'
210 210 ```
211 211  
212 212 得到:
... ... @@ -276,5 +276,5 @@ offload_kqv: true
276 276 - `config/config.yaml`
277 277 - `scripts/setup_reranker_venv.sh`
278 278 - `scripts/start_reranker.sh`
279   -- `scripts/benchmark_reranker_gguf_local.py`
  279 +- `benchmarks/reranker/benchmark_reranker_gguf_local.py`
280 280 - `reranker/GGUF_INSTALL_AND_TUNING.md`
... ...
reranker/README.md
... ... @@ -46,9 +46,9 @@ Reranker 服务提供统一的 `/rerank` API,支持可插拔后端(BGE、Jin
46 46 - `backends/dashscope_rerank.py`:DashScope 云端重排后端
47 47 - `scripts/setup_reranker_venv.sh`:按后端创建独立 venv
48 48 - `scripts/start_reranker.sh`:启动 reranker 服务
49   -- `scripts/smoke_qwen3_vllm_score_backend.py`:`qwen3_vllm_score` 本地 smoke
50   -- `scripts/benchmark_reranker_random_titles.py`:随机标题压测脚本
51   -- `scripts/run_reranker_vllm_instruction_benchmark.sh`:历史矩阵脚本
  49 +- `benchmarks/reranker/smoke_qwen3_vllm_score_backend.py`:`qwen3_vllm_score` 本地 smoke
  50 +- `benchmarks/reranker/benchmark_reranker_random_titles.py`:随机标题压测脚本
  51 +- `benchmarks/reranker/run_reranker_vllm_instruction_benchmark.sh`:历史矩阵脚本
52 52  
53 53 ## 环境基线
54 54  
... ... @@ -118,7 +118,7 @@ nvidia-smi
118 118 ### 4. Smoke
119 119  
120 120 ```bash
121   -PYTHONPATH=. ./.venv-reranker-score/bin/python scripts/smoke_qwen3_vllm_score_backend.py --gpu-memory-utilization 0.2
  121 +PYTHONPATH=. ./.venv-reranker-score/bin/python benchmarks/reranker/smoke_qwen3_vllm_score_backend.py --gpu-memory-utilization 0.2
122 122 ```
123 123  
124 124 ## `jina_reranker_v3`
... ...
scripts/README.md 0 → 100644
... ... @@ -0,0 +1,59 @@
  1 +# Scripts
  2 +
  3 +`scripts/` 现在只保留当前架构下仍然有效的运行、运维、环境和数据处理脚本,并按职责拆到稳定子目录,避免继续在根目录平铺。
  4 +
  5 +## 当前分类
  6 +
  7 +- 服务编排
  8 + - `service_ctl.sh`
  9 + - `start_backend.sh`
  10 + - `start_indexer.sh`
  11 + - `start_frontend.sh`
  12 + - `start_eval_web.sh`
  13 + - `start_embedding_service.sh`
  14 + - `start_embedding_text_service.sh`
  15 + - `start_embedding_image_service.sh`
  16 + - `start_reranker.sh`
  17 + - `start_translator.sh`
  18 + - `start_tei_service.sh`
  19 + - `start_cnclip_service.sh`
  20 + - `stop.sh`
  21 + - `stop_tei_service.sh`
  22 + - `stop_cnclip_service.sh`
  23 + - `frontend/`
  24 + - `ops/`
  25 +
  26 +- 环境初始化
  27 + - `create_venv.sh`
  28 + - `init_env.sh`
  29 + - `setup_embedding_venv.sh`
  30 + - `setup_reranker_venv.sh`
  31 + - `setup_translator_venv.sh`
  32 + - `setup_cnclip_venv.sh`
  33 +
  34 +- 数据与索引
  35 + - `create_tenant_index.sh`
  36 + - `build_suggestions.sh`
  37 + - `mock_data.sh`
  38 + - `data_import/`
  39 + - `inspect/`
  40 + - `maintenance/`
  41 +
  42 +- 评估与专项工具
  43 + - `evaluation/`
  44 + - `redis/`
  45 + - `debug/`
  46 + - `translation/`
  47 +
  48 +## 已迁移
  49 +
  50 +- 基准压测与 smoke 脚本:迁到 `benchmarks/`
  51 +- 手工接口试跑脚本:迁到 `tests/manual/`
  52 +
  53 +## 已清理
  54 +
  55 +- 历史备份目录:`indexer__old_2025_11/`
  56 +- 过时壳脚本:`start.sh`
  57 +- Conda 时代残留:`install_server_deps.sh`
  58 +
  59 +后续如果新增脚本,优先放到明确子目录,不再把 benchmark、manual、历史备份直接丢回根 `scripts/`。
... ...
scripts/data_import/README.md 0 → 100644
... ... @@ -0,0 +1,13 @@
  1 +# Data Import Scripts
  2 +
  3 +这一组脚本用于把外部商品数据或 CSV/XLSX 样本转换为 Shoplazza 导入格式。
  4 +
  5 +- `amazon_xlsx_to_shoplazza_xlsx.py`
  6 +- `competitor_xlsx_to_shoplazza_xlsx.py`
  7 +- `csv_to_excel.py`
  8 +- `csv_to_excel_multi_variant.py`
  9 +- `shoplazza_excel_template.py`
  10 +- `shoplazza_import_template.py`
  11 +- `tenant3_csv_to_shoplazza_xlsx.sh`
  12 +
  13 +这里是离线数据转换工具,不属于线上服务运维入口。
... ...
scripts/amazon_xlsx_to_shoplazza_xlsx.py renamed to scripts/data_import/amazon_xlsx_to_shoplazza_xlsx.py
... ... @@ -35,9 +35,10 @@ from pathlib import Path
35 35  
36 36 from openpyxl import load_workbook
37 37  
38   -# Allow running as `python scripts/xxx.py` without installing as a package
39   -sys.path.insert(0, str(Path(__file__).resolve().parent))
40   -from shoplazza_excel_template import create_excel_from_template_fast
  38 +REPO_ROOT = Path(__file__).resolve().parents[2]
  39 +sys.path.insert(0, str(REPO_ROOT))
  40 +
  41 +from scripts.data_import.shoplazza_excel_template import create_excel_from_template_fast
41 42  
42 43  
43 44 PREFERRED_OPTION_KEYS = [
... ... @@ -612,4 +613,3 @@ def main():
612 613 if __name__ == "__main__":
613 614 main()
614 615  
615   -
... ...
scripts/competitor_xlsx_to_shoplazza_xlsx.py renamed to scripts/data_import/competitor_xlsx_to_shoplazza_xlsx.py
... ... @@ -6,7 +6,7 @@ The input `data/mai_jia_jing_ling/products_data/*.xlsx` files are Amazon-format
6 6 (Parent/Child ASIN), not “competitor data”.
7 7  
8 8 Please use:
9   - - `scripts/amazon_xlsx_to_shoplazza_xlsx.py`
  9 + - `scripts/data_import/amazon_xlsx_to_shoplazza_xlsx.py`
10 10  
11 11 This wrapper simply forwards all CLI args to the correctly named script, so you
12 12 automatically get the latest performance improvements (fast read/write).
... ... @@ -15,13 +15,12 @@ automatically get the latest performance improvements (fast read/write).
15 15 import sys
16 16 from pathlib import Path
17 17  
18   -# Allow running as `python scripts/xxx.py` without installing as a package
19   -sys.path.insert(0, str(Path(__file__).resolve().parent))
  18 +REPO_ROOT = Path(__file__).resolve().parents[2]
  19 +sys.path.insert(0, str(REPO_ROOT))
20 20  
21   -from amazon_xlsx_to_shoplazza_xlsx import main as amazon_main
  21 +from scripts.data_import.amazon_xlsx_to_shoplazza_xlsx import main as amazon_main
22 22  
23 23  
24 24 if __name__ == "__main__":
25 25 amazon_main()
26 26  
27   -
... ...
scripts/csv_to_excel.py renamed to scripts/data_import/csv_to_excel.py
... ... @@ -22,12 +22,12 @@ from openpyxl import load_workbook
22 22 from openpyxl.styles import Font, Alignment
23 23 from openpyxl.utils import get_column_letter
24 24  
25   -# Shared helpers (keeps template writing consistent across scripts)
26   -from scripts.shoplazza_import_template import create_excel_from_template as _create_excel_from_template_shared
27   -from scripts.shoplazza_import_template import generate_handle as _generate_handle_shared
  25 +REPO_ROOT = Path(__file__).resolve().parents[2]
  26 +sys.path.insert(0, str(REPO_ROOT))
28 27  
29   -# Add parent directory to path
30   -sys.path.insert(0, str(Path(__file__).parent.parent))
  28 +# Shared helpers (keeps template writing consistent across scripts)
  29 +from scripts.data_import.shoplazza_import_template import create_excel_from_template as _create_excel_from_template_shared
  30 +from scripts.data_import.shoplazza_import_template import generate_handle as _generate_handle_shared
31 31  
32 32  
33 33 def clean_value(value):
... ... @@ -299,4 +299,3 @@ def main():
299 299  
300 300 if __name__ == '__main__':
301 301 main()
302   -
... ...
scripts/csv_to_excel_multi_variant.py renamed to scripts/data_import/csv_to_excel_multi_variant.py
... ... @@ -22,12 +22,12 @@ import itertools
22 22 from openpyxl import load_workbook
23 23 from openpyxl.styles import Alignment
24 24  
25   -# Shared helpers (keeps template writing consistent across scripts)
26   -from scripts.shoplazza_import_template import create_excel_from_template as _create_excel_from_template_shared
27   -from scripts.shoplazza_import_template import generate_handle as _generate_handle_shared
  25 +REPO_ROOT = Path(__file__).resolve().parents[2]
  26 +sys.path.insert(0, str(REPO_ROOT))
28 27  
29   -# Add parent directory to path
30   -sys.path.insert(0, str(Path(__file__).parent.parent))
  28 +# Shared helpers (keeps template writing consistent across scripts)
  29 +from scripts.data_import.shoplazza_import_template import create_excel_from_template as _create_excel_from_template_shared
  30 +from scripts.data_import.shoplazza_import_template import generate_handle as _generate_handle_shared
31 31  
32 32 # Color definitions
33 33 COLORS = [
... ... @@ -562,4 +562,3 @@ def main():
562 562  
563 563 if __name__ == '__main__':
564 564 main()
565   -
... ...
scripts/shoplazza_excel_template.py renamed to scripts/data_import/shoplazza_excel_template.py
scripts/shoplazza_import_template.py renamed to scripts/data_import/shoplazza_import_template.py
scripts/tenant3__csv_to_shoplazza_xlsx.sh renamed to scripts/data_import/tenant3_csv_to_shoplazza_xlsx.sh
... ... @@ -5,16 +5,16 @@ cd &quot;$(dirname &quot;$0&quot;)/..&quot;
5 5 source ./activate.sh
6 6  
7 7 # # 基本使用(生成所有数据)
8   -# python scripts/csv_to_excel.py
  8 +# python scripts/data_import/csv_to_excel.py
9 9  
10 10 # # 指定输出文件
11   -# python scripts/csv_to_excel.py --output tenant3_imports.xlsx
  11 +# python scripts/data_import/csv_to_excel.py --output tenant3_imports.xlsx
12 12  
13 13 # # 限制处理行数(用于测试)
14   -# python scripts/csv_to_excel.py --limit 100
  14 +# python scripts/data_import/csv_to_excel.py --limit 100
15 15  
16 16 # 指定CSV文件和模板文件
17   -python scripts/csv_to_excel.py \
  17 +python scripts/data_import/csv_to_excel.py \
18 18 --csv-file data/customer1/goods_with_pic.5years_congku.csv.shuf.1w \
19 19 --template docs/商品导入模板.xlsx \
20   - --output tenant3_imports.xlsx
21 20 \ No newline at end of file
  21 + --output tenant3_imports.xlsx
... ...
scripts/trace_indexer_calls.sh renamed to scripts/debug/trace_indexer_calls.sh
1 1 #!/bin/bash
2 2 #
3 3 # 排查「谁在调用索引服务」的脚本
4   -# 用法: ./scripts/trace_indexer_calls.sh
  4 +# 用法: ./scripts/debug/trace_indexer_calls.sh
5 5 #
6 6  
7 7 set -euo pipefail
... ...
scripts/download_translation_models.py 100755 → 100644
1 1 #!/usr/bin/env python3
2   -"""Download local translation models declared in services.translation.capabilities."""
  2 +"""Backward-compatible entrypoint for translation model downloads."""
3 3  
4 4 from __future__ import annotations
5 5  
6   -import argparse
7   -import os
  6 +import runpy
8 7 from pathlib import Path
9   -import shutil
10   -import subprocess
11   -import sys
12   -from typing import Iterable
13   -
14   -from huggingface_hub import snapshot_download
15   -
16   -PROJECT_ROOT = Path(__file__).resolve().parent.parent
17   -if str(PROJECT_ROOT) not in sys.path:
18   - sys.path.insert(0, str(PROJECT_ROOT))
19   -os.environ.setdefault("HF_HUB_DISABLE_XET", "1")
20   -
21   -from config.services_config import get_translation_config
22   -
23   -
24   -LOCAL_BACKENDS = {"local_nllb", "local_marian"}
25   -
26   -
27   -def iter_local_capabilities(selected: set[str] | None = None) -> Iterable[tuple[str, dict]]:
28   - cfg = get_translation_config()
29   - capabilities = cfg.get("capabilities", {}) if isinstance(cfg, dict) else {}
30   - for name, capability in capabilities.items():
31   - backend = str(capability.get("backend") or "").strip().lower()
32   - if backend not in LOCAL_BACKENDS:
33   - continue
34   - if selected and name not in selected:
35   - continue
36   - yield name, capability
37   -
38   -
39   -def _compute_ct2_output_dir(capability: dict) -> Path:
40   - custom = str(capability.get("ct2_model_dir") or "").strip()
41   - if custom:
42   - return Path(custom).expanduser()
43   - model_dir = Path(str(capability.get("model_dir") or "")).expanduser()
44   - compute_type = str(capability.get("ct2_compute_type") or capability.get("torch_dtype") or "default").strip().lower()
45   - normalized = compute_type.replace("_", "-")
46   - return model_dir / f"ctranslate2-{normalized}"
47   -
48   -
49   -def _resolve_converter_binary() -> str:
50   - candidate = shutil.which("ct2-transformers-converter")
51   - if candidate:
52   - return candidate
53   - venv_candidate = Path(sys.executable).absolute().parent / "ct2-transformers-converter"
54   - if venv_candidate.exists():
55   - return str(venv_candidate)
56   - raise RuntimeError(
57   - "ct2-transformers-converter was not found. "
58   - "Install ctranslate2 in the active Python environment first."
59   - )
60   -
61   -
62   -def convert_to_ctranslate2(name: str, capability: dict) -> None:
63   - model_id = str(capability.get("model_id") or "").strip()
64   - model_dir = Path(str(capability.get("model_dir") or "")).expanduser()
65   - model_source = str(model_dir if model_dir.exists() else model_id)
66   - output_dir = _compute_ct2_output_dir(capability)
67   - if (output_dir / "model.bin").exists():
68   - print(f"[skip-convert] {name} -> {output_dir}")
69   - return
70   - quantization = str(
71   - capability.get("ct2_conversion_quantization")
72   - or capability.get("ct2_compute_type")
73   - or capability.get("torch_dtype")
74   - or "default"
75   - ).strip()
76   - output_dir.parent.mkdir(parents=True, exist_ok=True)
77   - print(f"[convert] {name} -> {output_dir} ({quantization})")
78   - subprocess.run(
79   - [
80   - _resolve_converter_binary(),
81   - "--model",
82   - model_source,
83   - "--output_dir",
84   - str(output_dir),
85   - "--quantization",
86   - quantization,
87   - ],
88   - check=True,
89   - )
90   - print(f"[converted] {name}")
91   -
92   -
93   -def main() -> None:
94   - parser = argparse.ArgumentParser(description="Download local translation models")
95   - parser.add_argument("--all-local", action="store_true", help="Download all configured local translation models")
96   - parser.add_argument("--models", nargs="*", default=[], help="Specific capability names to download")
97   - parser.add_argument(
98   - "--convert-ctranslate2",
99   - action="store_true",
100   - help="Also convert the downloaded Hugging Face models into CTranslate2 format",
101   - )
102   - args = parser.parse_args()
103   -
104   - selected = {item.strip().lower() for item in args.models if item.strip()} or None
105   - if not args.all_local and not selected:
106   - parser.error("pass --all-local or --models <name> ...")
107   -
108   - for name, capability in iter_local_capabilities(selected):
109   - model_id = str(capability.get("model_id") or "").strip()
110   - model_dir = Path(str(capability.get("model_dir") or "")).expanduser()
111   - if not model_id or not model_dir:
112   - raise ValueError(f"Capability '{name}' must define model_id and model_dir")
113   - model_dir.parent.mkdir(parents=True, exist_ok=True)
114   - print(f"[download] {name} -> {model_dir} ({model_id})")
115   - snapshot_download(
116   - repo_id=model_id,
117   - local_dir=str(model_dir),
118   - )
119   - print(f"[done] {name}")
120   - if args.convert_ctranslate2:
121   - convert_to_ctranslate2(name, capability)
122 8  
123 9  
124 10 if __name__ == "__main__":
125   - main()
  11 + target = Path(__file__).resolve().parent / "translation" / "download_translation_models.py"
  12 + runpy.run_path(str(target), run_name="__main__")
... ...
scripts/evaluation/README.md
... ... @@ -127,8 +127,8 @@ This framework now follows graded ranking evaluation closer to e-commerce best p
127 127 - **Composite tuning score: `Primary_Metric_Score`**
128 128 For experiment ranking we compute the mean of the primary scorecard after normalizing `Avg_Grade@10` by the max grade (`3`).
129 129 - **Gain scheme**
130   - `Fully Relevant=7`, `Mostly Relevant=3`, `Weakly Relevant=1`, `Irrelevant=0`
131   - The gains come from rel grades `3/2/1/0` with `gain = 2^rel - 1`, a standard `NDCG` setup.
  130 + `Fully Relevant=3`, `Mostly Relevant=2`, `Weakly Relevant=1`, `Irrelevant=0`
  131 + We keep the rel grades `3/2/1/0`, but the current implementation uses the grade values directly as gains so the exact/high gap is less aggressive.
132 132 - **Why this is better**
133 133 `NDCG` differentiates “exact”, “strong substitute”, and “weak substitute”, so swapping an `Fully Relevant` with a `Weakly Relevant` item is penalized more than swapping `Mostly Relevant` with `Weakly Relevant`.
134 134  
... ... @@ -174,6 +174,22 @@ Features: query list from `queries.txt`, single-query and batch evaluation, batc
174 174  
175 175 Each run stores aggregate and per-query metrics, label distribution, timestamp, metric context (including gain scheme and primary metric), and an `/admin/config` snapshot, as Markdown and JSON under `batch_reports/`.
176 176  
  177 +To make later case analysis reproducible without digging through backend logs, each per-query record in the batch JSON now also includes:
  178 +
  179 +- `request_id` — the exact `X-Request-ID` sent by the evaluator for that live search call
  180 +- `top_label_sequence_top10` / `top_label_sequence_top20` — compact label sequence strings such as `1:L3 | 2:L1 | 3:L2`
  181 +- `top_results` — a lightweight top-20 snapshot with `rank`, `spu_id`, `label`, title fields, and `relevance_score`
  182 +
  183 +The Markdown report now surfaces the same case context in a lighter human-readable form:
  184 +
  185 +- request id
  186 +- top-10 / top-20 label sequence
  187 +- top 5 result snapshot for quick scanning
  188 +
  189 +This means a bad case can usually be reconstructed directly from the batch artifact itself, without replaying logs or joining SQLite tables by hand.
  190 +
  191 +The web history endpoint intentionally returns a compact summary only (aggregate metrics plus query count), so adding richer per-query snapshots to the batch payload does not bloat the history list UI.
  192 +
177 193 ## Ranking debug and LTR prep
178 194  
179 195 `debug_info` now exposes two extra layers that are useful for tuning and future learning-to-rank work:
... ...
scripts/evaluation/eval_framework/__init__.py
... ... @@ -14,10 +14,10 @@ from .constants import ( # noqa: E402
14 14 DEFAULT_ARTIFACT_ROOT,
15 15 DEFAULT_QUERY_FILE,
16 16 PROJECT_ROOT,
17   - RELEVANCE_EXACT,
18   - RELEVANCE_HIGH,
19   - RELEVANCE_IRRELEVANT,
20   - RELEVANCE_LOW,
  17 + RELEVANCE_LV0,
  18 + RELEVANCE_LV1,
  19 + RELEVANCE_LV2,
  20 + RELEVANCE_LV3,
21 21 RELEVANCE_NON_IRRELEVANT,
22 22 VALID_LABELS,
23 23 )
... ... @@ -39,10 +39,10 @@ __all__ = [
39 39 "EvalStore",
40 40 "PROJECT_ROOT",
41 41 "QueryBuildResult",
42   - "RELEVANCE_EXACT",
43   - "RELEVANCE_HIGH",
44   - "RELEVANCE_IRRELEVANT",
45   - "RELEVANCE_LOW",
  42 + "RELEVANCE_LV0",
  43 + "RELEVANCE_LV1",
  44 + "RELEVANCE_LV2",
  45 + "RELEVANCE_LV3",
46 46 "RELEVANCE_NON_IRRELEVANT",
47 47 "SearchEvaluationFramework",
48 48 "VALID_LABELS",
... ...
scripts/evaluation/eval_framework/clients.py
... ... @@ -157,6 +157,7 @@ class SearchServiceClient:
157 157 return self._request_json("GET", path, timeout=timeout)
158 158  
159 159 def search(self, query: str, size: int, from_: int = 0, language: str = "en", *, debug: bool = False) -> Dict[str, Any]:
  160 + request_id = uuid.uuid4().hex[:8]
160 161 payload: Dict[str, Any] = {
161 162 "query": query,
162 163 "size": size,
... ... @@ -165,13 +166,19 @@ class SearchServiceClient:
165 166 }
166 167 if debug:
167 168 payload["debug"] = True
168   - return self._request_json(
  169 + response = self._request_json(
169 170 "POST",
170 171 "/search/",
171 172 timeout=120,
172   - headers={"Content-Type": "application/json", "X-Tenant-ID": self.tenant_id},
  173 + headers={
  174 + "Content-Type": "application/json",
  175 + "X-Tenant-ID": self.tenant_id,
  176 + "X-Request-ID": request_id,
  177 + },
173 178 json_payload=payload,
174 179 )
  180 + response["_eval_request_id"] = request_id
  181 + return response
175 182  
176 183  
177 184 class RerankServiceClient:
... ...
scripts/evaluation/eval_framework/constants.py
... ... @@ -7,24 +7,24 @@ _SCRIPTS_EVAL_DIR = _PKG_DIR.parent
7 7 PROJECT_ROOT = _SCRIPTS_EVAL_DIR.parents[1]
8 8  
9 9 # Canonical English labels (must match LLM prompt output in prompts._CLASSIFY_TEMPLATE_EN)
10   -RELEVANCE_EXACT = "Fully Relevant"
11   -RELEVANCE_HIGH = "Mostly Relevant"
12   -RELEVANCE_LOW = "Weakly Relevant"
13   -RELEVANCE_IRRELEVANT = "Irrelevant"
  10 +RELEVANCE_LV3 = "Fully Relevant"
  11 +RELEVANCE_LV2 = "Mostly Relevant"
  12 +RELEVANCE_LV1 = "Weakly Relevant"
  13 +RELEVANCE_LV0 = "Irrelevant"
14 14  
15   -VALID_LABELS = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_LOW, RELEVANCE_IRRELEVANT})
  15 +VALID_LABELS = frozenset({RELEVANCE_LV3, RELEVANCE_LV2, RELEVANCE_LV1, RELEVANCE_LV0})
16 16  
17 17 # Useful label sets for binary diagnostic slices layered on top of graded ranking metrics.
18   -RELEVANCE_NON_IRRELEVANT = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_LOW})
19   -RELEVANCE_STRONG = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH})
  18 +RELEVANCE_NON_IRRELEVANT = frozenset({RELEVANCE_LV3, RELEVANCE_LV2, RELEVANCE_LV1})
  19 +RELEVANCE_STRONG = frozenset({RELEVANCE_LV3, RELEVANCE_LV2})
20 20  
21 21 # Graded relevance for ranking evaluation.
22 22 # We use rel grades 3/2/1/0 and gain = 2^rel - 1, which is standard for NDCG-style metrics.
23 23 RELEVANCE_GRADE_MAP = {
24   - RELEVANCE_EXACT: 3,
25   - RELEVANCE_HIGH: 2,
26   - RELEVANCE_LOW: 1,
27   - RELEVANCE_IRRELEVANT: 0,
  24 + RELEVANCE_LV3: 3,
  25 + RELEVANCE_LV2: 2,
  26 + RELEVANCE_LV1: 1,
  27 + RELEVANCE_LV0: 0,
28 28 }
29 29 # 标准的gain计算方法:2^rel - 1
30 30 # 但是是因为标注质量不是特别精确,因此适当降低 exact 和 high 的区分度
... ... @@ -35,11 +35,12 @@ RELEVANCE_GAIN_MAP = {
35 35 }
36 36  
37 37 # P(stop | relevance) for ERR (Expected Reciprocal Rank); cascade model (Chapelle et al., 2009).
  38 +# p(t) = (2^t - 1) / 2^{max_grade}
38 39 STOP_PROB_MAP = {
39   - RELEVANCE_EXACT: 0.99,
40   - RELEVANCE_HIGH: 0.8,
41   - RELEVANCE_LOW: 0.1,
42   - RELEVANCE_IRRELEVANT: 0.0,
  40 + RELEVANCE_LV3: 0.875,
  41 + RELEVANCE_LV2: 0.375,
  42 + RELEVANCE_LV1: 0.125,
  43 + RELEVANCE_LV0: 0.0,
43 44 }
44 45  
45 46 DEFAULT_ARTIFACT_ROOT = PROJECT_ROOT / "artifacts" / "search_evaluation"
... ... @@ -78,7 +79,7 @@ DEFAULT_REBUILD_MAX_LLM_BATCHES = 40
78 79 # A batch is "bad" when **both** hold (strict inequalities; see ``framework._annotate_rebuild_batches``):
79 80 # - irrelevant_ratio > DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO (default 93.9%),
80 81 # - (Irrelevant + Weakly Relevant) / n > DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO (default 95.9%).
81   -# ``irrelevant_ratio`` = Irrelevant count / n; weak relevance is ``RELEVANCE_LOW`` ("Weakly Relevant").
  82 +# ``irrelevant_ratio`` = Irrelevant count / n; weak relevance is ``RELEVANCE_LV1`` ("Weakly Relevant").
82 83 # Increment streak on consecutive bad batches; reset on any non-bad batch. Stop when streak
83 84 # reaches ``DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK`` (default 3).
84 85 DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO = 0.799
... ...
scripts/evaluation/eval_framework/framework.py
... ... @@ -25,14 +25,14 @@ from .constants import (
25 25 DEFAULT_RERANK_HIGH_SKIP_COUNT,
26 26 DEFAULT_RERANK_HIGH_THRESHOLD,
27 27 DEFAULT_SEARCH_RECALL_TOP_K,
28   - RELEVANCE_EXACT,
29 28 RELEVANCE_GAIN_MAP,
30   - RELEVANCE_HIGH,
31   - STOP_PROB_MAP,
32   - RELEVANCE_IRRELEVANT,
33   - RELEVANCE_LOW,
  29 + RELEVANCE_LV0,
  30 + RELEVANCE_LV1,
  31 + RELEVANCE_LV2,
  32 + RELEVANCE_LV3,
34 33 RELEVANCE_NON_IRRELEVANT,
35 34 VALID_LABELS,
  35 + STOP_PROB_MAP,
36 36 )
37 37 from .metrics import (
38 38 PRIMARY_METRIC_GRADE_NORMALIZER,
... ... @@ -96,6 +96,16 @@ def _zh_titles_from_debug_per_result(debug_info: Any) -&gt; Dict[str, str]:
96 96 return out
97 97  
98 98  
  99 +def _encode_label_sequence(items: Sequence[Dict[str, Any]], limit: int) -> str:
  100 + parts: List[str] = []
  101 + for item in items[:limit]:
  102 + rank = int(item.get("rank") or 0)
  103 + label = str(item.get("label") or "")
  104 + grade = RELEVANCE_GAIN_MAP.get(label)
  105 + parts.append(f"{rank}:L{grade}" if grade is not None else f"{rank}:?")
  106 + return " | ".join(parts)
  107 +
  108 +
99 109 class SearchEvaluationFramework:
100 110 def __init__(
101 111 self,
... ... @@ -168,7 +178,7 @@ class SearchEvaluationFramework:
168 178 ) -> Dict[str, Any]:
169 179 live = self.evaluate_live_query(query=query, top_k=top_k, auto_annotate=auto_annotate, language=language)
170 180 labels = [
171   - item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT
  181 + item["label"] if item["label"] in VALID_LABELS else RELEVANCE_LV0
172 182 for item in live["results"]
173 183 ]
174 184 return {
... ... @@ -432,7 +442,7 @@ class SearchEvaluationFramework:
432 442  
433 443 - ``#(Irrelevant)/n > irrelevant_stop_ratio`` (default 0.939), and
434 444 - ``( #(Irrelevant) + #(Weakly Relevant) ) / n > irrelevant_low_combined_stop_ratio``
435   - (default 0.959; weak relevance = ``RELEVANCE_LOW``).
  445 + (default 0.959; weak relevance = ``RELEVANCE_LV1``).
436 446  
437 447 Maintain a streak of consecutive *bad* batches; any non-bad batch resets the streak to 0.
438 448 Stop labeling when ``streak >= stop_streak`` (default 3) or when ``max_batches`` is reached
... ... @@ -474,9 +484,9 @@ class SearchEvaluationFramework:
474 484 time.sleep(0.1)
475 485  
476 486 n = len(batch_docs)
477   - exact_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_EXACT)
478   - irrel_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_IRRELEVANT)
479   - low_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_LOW)
  487 + exact_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_LV3)
  488 + irrel_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_LV0)
  489 + low_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_LV1)
480 490 exact_ratio = exact_n / n if n else 0.0
481 491 irrelevant_ratio = irrel_n / n if n else 0.0
482 492 low_ratio = low_n / n if n else 0.0
... ... @@ -633,7 +643,7 @@ class SearchEvaluationFramework:
633 643 )
634 644  
635 645 top100_labels = [
636   - item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT
  646 + item["label"] if item["label"] in VALID_LABELS else RELEVANCE_LV0
637 647 for item in search_labeled_results[:100]
638 648 ]
639 649 metrics = compute_query_metrics(top100_labels, ideal_labels=list(labels.values()))
... ... @@ -843,7 +853,7 @@ class SearchEvaluationFramework:
843 853 )
844 854  
845 855 top100_labels = [
846   - item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT
  856 + item["label"] if item["label"] in VALID_LABELS else RELEVANCE_LV0
847 857 for item in search_labeled_results[:100]
848 858 ]
849 859 metrics = compute_query_metrics(top100_labels, ideal_labels=list(labels.values()))
... ... @@ -920,16 +930,17 @@ class SearchEvaluationFramework:
920 930 "title_zh": title_zh if title_zh and title_zh != primary_title else "",
921 931 "image_url": doc.get("image_url"),
922 932 "label": label,
  933 + "relevance_score": doc.get("relevance_score"),
923 934 "option_values": list(compact_option_values(doc.get("skus") or [])),
924 935 "product": compact_product_payload(doc),
925 936 }
926 937 )
927 938 metric_labels = [
928   - item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT
  939 + item["label"] if item["label"] in VALID_LABELS else RELEVANCE_LV0
929 940 for item in labeled
930 941 ]
931 942 ideal_labels = [
932   - label if label in VALID_LABELS else RELEVANCE_IRRELEVANT
  943 + label if label in VALID_LABELS else RELEVANCE_LV0
933 944 for label in labels.values()
934 945 ]
935 946 label_stats = self.store.get_query_label_stats(self.tenant_id, query)
... ... @@ -960,10 +971,10 @@ class SearchEvaluationFramework:
960 971 }
961 972 )
962 973 label_order = {
963   - RELEVANCE_EXACT: 0,
964   - RELEVANCE_HIGH: 1,
965   - RELEVANCE_LOW: 2,
966   - RELEVANCE_IRRELEVANT: 3,
  974 + RELEVANCE_LV3: 0,
  975 + RELEVANCE_LV2: 1,
  976 + RELEVANCE_LV1: 2,
  977 + RELEVANCE_LV0: 3,
967 978 }
968 979 missing_relevant.sort(
969 980 key=lambda item: (
... ... @@ -989,6 +1000,7 @@ class SearchEvaluationFramework:
989 1000 "top_k": top_k,
990 1001 "metrics": compute_query_metrics(metric_labels, ideal_labels=ideal_labels),
991 1002 "metric_context": _metric_context_payload(),
  1003 + "request_id": str(search_payload.get("_eval_request_id") or ""),
992 1004 "results": labeled,
993 1005 "missing_relevant": missing_relevant,
994 1006 "label_stats": {
... ... @@ -996,9 +1008,9 @@ class SearchEvaluationFramework:
996 1008 "unlabeled_hits_treated_irrelevant": unlabeled_hits,
997 1009 "recalled_hits": len(labeled),
998 1010 "missing_relevant_count": len(missing_relevant),
999   - "missing_exact_count": sum(1 for item in missing_relevant if item["label"] == RELEVANCE_EXACT),
1000   - "missing_high_count": sum(1 for item in missing_relevant if item["label"] == RELEVANCE_HIGH),
1001   - "missing_low_count": sum(1 for item in missing_relevant if item["label"] == RELEVANCE_LOW),
  1011 + "missing_exact_count": sum(1 for item in missing_relevant if item["label"] == RELEVANCE_LV3),
  1012 + "missing_high_count": sum(1 for item in missing_relevant if item["label"] == RELEVANCE_LV2),
  1013 + "missing_low_count": sum(1 for item in missing_relevant if item["label"] == RELEVANCE_LV1),
1002 1014 },
1003 1015 "tips": tips,
1004 1016 "total": int(search_payload.get("total") or 0),
... ... @@ -1014,6 +1026,7 @@ class SearchEvaluationFramework:
1014 1026 force_refresh_labels: bool = False,
1015 1027 ) -> Dict[str, Any]:
1016 1028 per_query = []
  1029 + case_snapshot_top_n = min(max(int(top_k), 1), 20)
1017 1030 total_q = len(queries)
1018 1031 _log.info("[batch-eval] starting %s queries top_k=%s auto_annotate=%s", total_q, top_k, auto_annotate)
1019 1032 for q_index, query in enumerate(queries, start=1):
... ... @@ -1025,7 +1038,7 @@ class SearchEvaluationFramework:
1025 1038 force_refresh_labels=force_refresh_labels,
1026 1039 )
1027 1040 labels = [
1028   - item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT
  1041 + item["label"] if item["label"] in VALID_LABELS else RELEVANCE_LV0
1029 1042 for item in live["results"]
1030 1043 ]
1031 1044 per_query.append(
... ... @@ -1036,6 +1049,21 @@ class SearchEvaluationFramework:
1036 1049 "metrics": live["metrics"],
1037 1050 "distribution": label_distribution(labels),
1038 1051 "total": live["total"],
  1052 + "request_id": live.get("request_id") or "",
  1053 + "case_snapshot_top_n": case_snapshot_top_n,
  1054 + "top_label_sequence_top10": _encode_label_sequence(live["results"], 10),
  1055 + "top_label_sequence_top20": _encode_label_sequence(live["results"], case_snapshot_top_n),
  1056 + "top_results": [
  1057 + {
  1058 + "rank": int(item.get("rank") or 0),
  1059 + "spu_id": str(item.get("spu_id") or ""),
  1060 + "label": item.get("label"),
  1061 + "title": item.get("title"),
  1062 + "title_zh": item.get("title_zh"),
  1063 + "relevance_score": item.get("relevance_score"),
  1064 + }
  1065 + for item in live["results"][:case_snapshot_top_n]
  1066 + ],
1039 1067 }
1040 1068 )
1041 1069 m = live["metrics"]
... ... @@ -1055,10 +1083,10 @@ class SearchEvaluationFramework:
1055 1083 )
1056 1084 aggregate = aggregate_metrics([item["metrics"] for item in per_query])
1057 1085 aggregate_distribution = {
1058   - RELEVANCE_EXACT: sum(item["distribution"][RELEVANCE_EXACT] for item in per_query),
1059   - RELEVANCE_HIGH: sum(item["distribution"][RELEVANCE_HIGH] for item in per_query),
1060   - RELEVANCE_LOW: sum(item["distribution"][RELEVANCE_LOW] for item in per_query),
1061   - RELEVANCE_IRRELEVANT: sum(item["distribution"][RELEVANCE_IRRELEVANT] for item in per_query),
  1086 + RELEVANCE_LV3: sum(item["distribution"][RELEVANCE_LV3] for item in per_query),
  1087 + RELEVANCE_LV2: sum(item["distribution"][RELEVANCE_LV2] for item in per_query),
  1088 + RELEVANCE_LV1: sum(item["distribution"][RELEVANCE_LV1] for item in per_query),
  1089 + RELEVANCE_LV0: sum(item["distribution"][RELEVANCE_LV0] for item in per_query),
1062 1090 }
1063 1091 batch_id = f"batch_{utc_timestamp()}_{sha1_text(self.tenant_id + '|' + '|'.join(queries))[:10]}"
1064 1092 report_dir = ensure_dir(self.artifact_root / "batch_reports")
... ...
scripts/evaluation/eval_framework/metrics.py
... ... @@ -6,12 +6,12 @@ import math
6 6 from typing import Dict, Iterable, Sequence
7 7  
8 8 from .constants import (
9   - RELEVANCE_EXACT,
10 9 RELEVANCE_GAIN_MAP,
11 10 RELEVANCE_GRADE_MAP,
12   - RELEVANCE_HIGH,
13   - RELEVANCE_IRRELEVANT,
14   - RELEVANCE_LOW,
  11 + RELEVANCE_LV0,
  12 + RELEVANCE_LV1,
  13 + RELEVANCE_LV2,
  14 + RELEVANCE_LV3,
15 15 RELEVANCE_NON_IRRELEVANT,
16 16 RELEVANCE_STRONG,
17 17 STOP_PROB_MAP,
... ... @@ -33,7 +33,7 @@ PRIMARY_METRIC_GRADE_NORMALIZER = float(max(RELEVANCE_GRADE_MAP.values()) or 1.0
33 33 def _normalize_label(label: str) -> str:
34 34 if label in RELEVANCE_GRADE_MAP:
35 35 return label
36   - return RELEVANCE_IRRELEVANT
  36 + return RELEVANCE_LV0
37 37  
38 38  
39 39 def _gains_for_labels(labels: Sequence[str]) -> list[float]:
... ... @@ -135,7 +135,7 @@ def compute_query_metrics(
135 135 ideal = list(ideal_labels) if ideal_labels is not None else list(labels)
136 136 metrics: Dict[str, float] = {}
137 137  
138   - exact_hits = _binary_hits(labels, [RELEVANCE_EXACT])
  138 + exact_hits = _binary_hits(labels, [RELEVANCE_LV3])
139 139 strong_hits = _binary_hits(labels, RELEVANCE_STRONG)
140 140 useful_hits = _binary_hits(labels, RELEVANCE_NON_IRRELEVANT)
141 141  
... ... @@ -183,8 +183,8 @@ def aggregate_metrics(metric_items: Sequence[Dict[str, float]]) -&gt; Dict[str, flo
183 183  
184 184 def label_distribution(labels: Sequence[str]) -> Dict[str, int]:
185 185 return {
186   - RELEVANCE_EXACT: sum(1 for label in labels if label == RELEVANCE_EXACT),
187   - RELEVANCE_HIGH: sum(1 for label in labels if label == RELEVANCE_HIGH),
188   - RELEVANCE_LOW: sum(1 for label in labels if label == RELEVANCE_LOW),
189   - RELEVANCE_IRRELEVANT: sum(1 for label in labels if label == RELEVANCE_IRRELEVANT),
  186 + RELEVANCE_LV3: sum(1 for label in labels if label == RELEVANCE_LV3),
  187 + RELEVANCE_LV2: sum(1 for label in labels if label == RELEVANCE_LV2),
  188 + RELEVANCE_LV1: sum(1 for label in labels if label == RELEVANCE_LV1),
  189 + RELEVANCE_LV0: sum(1 for label in labels if label == RELEVANCE_LV0),
190 190 }
... ...
scripts/evaluation/eval_framework/reports.py
... ... @@ -4,7 +4,7 @@ from __future__ import annotations
4 4  
5 5 from typing import Any, Dict
6 6  
7   -from .constants import RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_IRRELEVANT, RELEVANCE_LOW
  7 +from .constants import RELEVANCE_GAIN_MAP, RELEVANCE_LV0, RELEVANCE_LV1, RELEVANCE_LV2, RELEVANCE_LV3
8 8 from .metrics import PRIMARY_METRIC_KEYS
9 9  
10 10  
... ... @@ -25,6 +25,38 @@ def _append_metric_block(lines: list[str], metrics: Dict[str, Any]) -&gt; None:
25 25 lines.append(f"- {key}: {value}")
26 26  
27 27  
  28 +def _label_level_code(label: str) -> str:
  29 + grade = RELEVANCE_GAIN_MAP.get(label)
  30 + return f"L{grade}" if grade is not None else "?"
  31 +
  32 +
  33 +def _append_case_snapshot(lines: list[str], item: Dict[str, Any]) -> None:
  34 + request_id = str(item.get("request_id") or "").strip()
  35 + if request_id:
  36 + lines.append(f"- Request ID: `{request_id}`")
  37 + seq10 = str(item.get("top_label_sequence_top10") or "").strip()
  38 + if seq10:
  39 + lines.append(f"- Top-10 Labels: `{seq10}`")
  40 + seq20 = str(item.get("top_label_sequence_top20") or "").strip()
  41 + if seq20 and seq20 != seq10:
  42 + lines.append(f"- Top-20 Labels: `{seq20}`")
  43 + top_results = item.get("top_results") or []
  44 + if not top_results:
  45 + return
  46 + lines.append("- Case Snapshot:")
  47 + for result in top_results[:5]:
  48 + rank = int(result.get("rank") or 0)
  49 + label = _label_level_code(str(result.get("label") or ""))
  50 + spu_id = str(result.get("spu_id") or "")
  51 + title = str(result.get("title") or "")
  52 + title_zh = str(result.get("title_zh") or "")
  53 + relevance_score = result.get("relevance_score")
  54 + score_suffix = f" (rel={relevance_score})" if relevance_score not in (None, "") else ""
  55 + lines.append(f" - #{rank} [{label}] spu={spu_id} {title}{score_suffix}")
  56 + if title_zh:
  57 + lines.append(f" zh: {title_zh}")
  58 +
  59 +
28 60 def render_batch_report_markdown(payload: Dict[str, Any]) -> str:
29 61 lines = [
30 62 "# Search Batch Evaluation",
... ... @@ -56,10 +88,10 @@ def render_batch_report_markdown(payload: Dict[str, Any]) -&gt; str:
56 88 "",
57 89 "## Label Distribution",
58 90 "",
59   - f"- Fully Relevant: {distribution.get(RELEVANCE_EXACT, 0)}",
60   - f"- Mostly Relevant: {distribution.get(RELEVANCE_HIGH, 0)}",
61   - f"- Weakly Relevant: {distribution.get(RELEVANCE_LOW, 0)}",
62   - f"- Irrelevant: {distribution.get(RELEVANCE_IRRELEVANT, 0)}",
  91 + f"- Fully Relevant: {distribution.get(RELEVANCE_LV3, 0)}",
  92 + f"- Mostly Relevant: {distribution.get(RELEVANCE_LV2, 0)}",
  93 + f"- Weakly Relevant: {distribution.get(RELEVANCE_LV1, 0)}",
  94 + f"- Irrelevant: {distribution.get(RELEVANCE_LV0, 0)}",
63 95 ]
64 96 )
65 97 lines.extend(["", "## Per Query", ""])
... ... @@ -68,9 +100,10 @@ def render_batch_report_markdown(payload: Dict[str, Any]) -&gt; str:
68 100 lines.append("")
69 101 _append_metric_block(lines, item.get("metrics") or {})
70 102 distribution = item.get("distribution") or {}
71   - lines.append(f"- Fully Relevant: {distribution.get(RELEVANCE_EXACT, 0)}")
72   - lines.append(f"- Mostly Relevant: {distribution.get(RELEVANCE_HIGH, 0)}")
73   - lines.append(f"- Weakly Relevant: {distribution.get(RELEVANCE_LOW, 0)}")
74   - lines.append(f"- Irrelevant: {distribution.get(RELEVANCE_IRRELEVANT, 0)}")
  103 + lines.append(f"- Fully Relevant: {distribution.get(RELEVANCE_LV3, 0)}")
  104 + lines.append(f"- Mostly Relevant: {distribution.get(RELEVANCE_LV2, 0)}")
  105 + lines.append(f"- Weakly Relevant: {distribution.get(RELEVANCE_LV1, 0)}")
  106 + lines.append(f"- Irrelevant: {distribution.get(RELEVANCE_LV0, 0)}")
  107 + _append_case_snapshot(lines, item)
75 108 lines.append("")
76 109 return "\n".join(lines)
... ...
scripts/evaluation/eval_framework/static/eval_web.js
... ... @@ -190,7 +190,7 @@ async function loadQueries() {
190 190  
191 191 function historySummaryHtml(meta) {
192 192 const m = meta && meta.aggregate_metrics;
193   - const nq = (meta && meta.queries && meta.queries.length) || (meta && meta.per_query && meta.per_query.length) || null;
  193 + const nq = (meta && meta.query_count) || (meta && meta.queries && meta.queries.length) || (meta && meta.per_query && meta.per_query.length) || null;
194 194 const parts = [];
195 195 if (nq != null) parts.push(`<span>Queries</span> ${nq}`);
196 196 if (m && m["Primary_Metric_Score"] != null) parts.push(`<span>Primary</span> ${fmtNumber(m["Primary_Metric_Score"])}`);
... ...
scripts/evaluation/eval_framework/store.py
... ... @@ -23,6 +23,18 @@ class QueryBuildResult:
23 23 output_json_path: Path
24 24  
25 25  
  26 +def _compact_batch_metadata(metadata: Dict[str, Any]) -> Dict[str, Any]:
  27 + return {
  28 + "batch_id": metadata.get("batch_id"),
  29 + "created_at": metadata.get("created_at"),
  30 + "tenant_id": metadata.get("tenant_id"),
  31 + "top_k": metadata.get("top_k"),
  32 + "query_count": len(metadata.get("queries") or []),
  33 + "aggregate_metrics": dict(metadata.get("aggregate_metrics") or {}),
  34 + "metric_context": dict(metadata.get("metric_context") or {}),
  35 + }
  36 +
  37 +
26 38 class EvalStore:
27 39 def __init__(self, db_path: Path):
28 40 self.db_path = db_path
... ... @@ -339,6 +351,7 @@ class EvalStore:
339 351 ).fetchall()
340 352 items: List[Dict[str, Any]] = []
341 353 for row in rows:
  354 + metadata = json.loads(row["metadata_json"])
342 355 items.append(
343 356 {
344 357 "batch_id": row["batch_id"],
... ... @@ -346,7 +359,7 @@ class EvalStore:
346 359 "output_json_path": row["output_json_path"],
347 360 "report_markdown_path": row["report_markdown_path"],
348 361 "config_snapshot_path": row["config_snapshot_path"],
349   - "metadata": json.loads(row["metadata_json"]),
  362 + "metadata": _compact_batch_metadata(metadata),
350 363 "created_at": row["created_at"],
351 364 }
352 365 )
... ...
scripts/evaluation/offline_ltr_fit.py
... ... @@ -23,11 +23,11 @@ if str(PROJECT_ROOT) not in sys.path:
23 23  
24 24 from scripts.evaluation.eval_framework.constants import (
25 25 DEFAULT_ARTIFACT_ROOT,
26   - RELEVANCE_EXACT,
27 26 RELEVANCE_GRADE_MAP,
28   - RELEVANCE_HIGH,
29   - RELEVANCE_IRRELEVANT,
30   - RELEVANCE_LOW,
  27 + RELEVANCE_LV0,
  28 + RELEVANCE_LV1,
  29 + RELEVANCE_LV2,
  30 + RELEVANCE_LV3,
31 31 )
32 32 from scripts.evaluation.eval_framework.metrics import aggregate_metrics, compute_query_metrics
33 33 from scripts.evaluation.eval_framework.store import EvalStore
... ... @@ -35,10 +35,10 @@ from scripts.evaluation.eval_framework.utils import ensure_dir, utc_timestamp
35 35  
36 36  
37 37 LABELS_BY_GRADE = {
38   - 3: RELEVANCE_EXACT,
39   - 2: RELEVANCE_HIGH,
40   - 1: RELEVANCE_LOW,
41   - 0: RELEVANCE_IRRELEVANT,
  38 + 3: RELEVANCE_LV3,
  39 + 2: RELEVANCE_LV2,
  40 + 1: RELEVANCE_LV1,
  41 + 0: RELEVANCE_LV0,
42 42 }
43 43  
44 44  
... ...
scripts/frontend/frontend_server.py 0 → 100755
... ... @@ -0,0 +1,278 @@
  1 +#!/usr/bin/env python3
  2 +"""
  3 +Simple HTTP server for saas-search frontend.
  4 +"""
  5 +
  6 +import http.server
  7 +import socketserver
  8 +import os
  9 +import sys
  10 +import logging
  11 +import time
  12 +import urllib.request
  13 +import urllib.error
  14 +from collections import defaultdict, deque
  15 +from pathlib import Path
  16 +from dotenv import load_dotenv
  17 +
  18 +# Load .env file
  19 +project_root = Path(__file__).resolve().parents[2]
  20 +load_dotenv(project_root / '.env')
  21 +
  22 +# Get API_BASE_URL from environment(默认不注入,避免被旧 .env 覆盖同源策略)
  23 +# 仅当显式设置 FRONTEND_INJECT_API_BASE_URL=1 时才注入 window.API_BASE_URL。
  24 +API_BASE_URL = os.getenv('API_BASE_URL') or None
  25 +INJECT_API_BASE_URL = os.getenv('FRONTEND_INJECT_API_BASE_URL', '0') == '1'
  26 +# Backend proxy target for same-origin API forwarding
  27 +BACKEND_PROXY_URL = os.getenv('BACKEND_PROXY_URL', 'http://127.0.0.1:6002').rstrip('/')
  28 +
  29 +# Change to frontend directory
  30 +frontend_dir = os.path.join(project_root, 'frontend')
  31 +os.chdir(frontend_dir)
  32 +
  33 +# FRONTEND_PORT is the canonical config; keep PORT as a secondary fallback.
  34 +PORT = int(os.getenv('FRONTEND_PORT', os.getenv('PORT', 6003)))
  35 +
  36 +# Configure logging to suppress scanner noise
  37 +logging.basicConfig(level=logging.ERROR, format='%(asctime)s - %(levelname)s - %(message)s')
  38 +
  39 +class RateLimitingMixin:
  40 + """Mixin for rate limiting requests by IP address."""
  41 + request_counts = defaultdict(deque)
  42 + rate_limit = 100 # requests per minute
  43 + window = 60 # seconds
  44 +
  45 + @classmethod
  46 + def is_rate_limited(cls, ip):
  47 + now = time.time()
  48 +
  49 + # Clean old requests
  50 + while cls.request_counts[ip] and cls.request_counts[ip][0] < now - cls.window:
  51 + cls.request_counts[ip].popleft()
  52 +
  53 + # Check rate limit
  54 + if len(cls.request_counts[ip]) > cls.rate_limit:
  55 + return True
  56 +
  57 + cls.request_counts[ip].append(now)
  58 + return False
  59 +
  60 +class MyHTTPRequestHandler(http.server.SimpleHTTPRequestHandler, RateLimitingMixin):
  61 + """Custom request handler with CORS support and robust error handling."""
  62 +
  63 + _ALLOWED_CORS_HEADERS = "Content-Type, X-Tenant-ID, X-Request-ID, Referer"
  64 +
  65 + def _is_proxy_path(self, path: str) -> bool:
  66 + """Return True for API paths that should be forwarded to backend service."""
  67 + return path.startswith('/search/') or path.startswith('/admin/') or path.startswith('/indexer/')
  68 +
  69 + def _proxy_to_backend(self):
  70 + """Proxy current request to backend service on the GPU server."""
  71 + target_url = f"{BACKEND_PROXY_URL}{self.path}"
  72 + method = self.command.upper()
  73 +
  74 + try:
  75 + content_length = int(self.headers.get('Content-Length', '0'))
  76 + except ValueError:
  77 + content_length = 0
  78 + body = self.rfile.read(content_length) if content_length > 0 else None
  79 +
  80 + forward_headers = {}
  81 + for key, value in self.headers.items():
  82 + lk = key.lower()
  83 + if lk in ('host', 'content-length', 'connection'):
  84 + continue
  85 + forward_headers[key] = value
  86 +
  87 + req = urllib.request.Request(
  88 + target_url,
  89 + data=body,
  90 + headers=forward_headers,
  91 + method=method,
  92 + )
  93 +
  94 + try:
  95 + with urllib.request.urlopen(req, timeout=30) as resp:
  96 + resp_body = resp.read()
  97 + self.send_response(resp.getcode())
  98 + for header, value in resp.getheaders():
  99 + lh = header.lower()
  100 + if lh in ('transfer-encoding', 'connection', 'content-length'):
  101 + continue
  102 + self.send_header(header, value)
  103 + self.end_headers()
  104 + self.wfile.write(resp_body)
  105 + except urllib.error.HTTPError as e:
  106 + err_body = e.read() if hasattr(e, 'read') else b''
  107 + self.send_response(e.code)
  108 + if e.headers:
  109 + for header, value in e.headers.items():
  110 + lh = header.lower()
  111 + if lh in ('transfer-encoding', 'connection', 'content-length'):
  112 + continue
  113 + self.send_header(header, value)
  114 + self.end_headers()
  115 + if err_body:
  116 + self.wfile.write(err_body)
  117 + except Exception as e:
  118 + logging.error(f"Backend proxy error for {method} {self.path}: {e}")
  119 + self.send_response(502)
  120 + self.send_header('Content-Type', 'application/json; charset=utf-8')
  121 + self.end_headers()
  122 + self.wfile.write(b'{"error":"Bad Gateway: backend proxy failed"}')
  123 +
  124 + def do_GET(self):
  125 + """Handle GET requests with API config injection."""
  126 + path = self.path.split('?')[0]
  127 +
  128 + # Proxy API paths to backend first
  129 + if self._is_proxy_path(path):
  130 + self._proxy_to_backend()
  131 + return
  132 +
  133 + # Route / to index.html
  134 + if path == '/' or path == '':
  135 + self.path = '/index.html' + (self.path.split('?', 1)[1] if '?' in self.path else '')
  136 +
  137 + # Inject API config for HTML files
  138 + if self.path.endswith('.html'):
  139 + self._serve_html_with_config()
  140 + else:
  141 + super().do_GET()
  142 +
  143 + def _serve_html_with_config(self):
  144 + """Serve HTML with optional API_BASE_URL injected."""
  145 + try:
  146 + file_path = self.path.lstrip('/')
  147 + if not os.path.exists(file_path):
  148 + self.send_error(404)
  149 + return
  150 +
  151 + with open(file_path, 'r', encoding='utf-8') as f:
  152 + html = f.read()
  153 +
  154 + # 默认不注入 API_BASE_URL,避免历史 .env(如 http://xx:6002)覆盖同源调用。
  155 + # 仅当 FRONTEND_INJECT_API_BASE_URL=1 且 API_BASE_URL 有值时才注入。
  156 + if INJECT_API_BASE_URL and API_BASE_URL:
  157 + config_script = f'<script>window.API_BASE_URL="{API_BASE_URL}";</script>\n '
  158 + html = html.replace('<script src="/static/js/app.js', config_script + '<script src="/static/js/app.js', 1)
  159 +
  160 + self.send_response(200)
  161 + self.send_header('Content-Type', 'text/html; charset=utf-8')
  162 + self.end_headers()
  163 + self.wfile.write(html.encode('utf-8'))
  164 + except Exception as e:
  165 + logging.error(f"Error serving HTML: {e}")
  166 + self.send_error(500)
  167 +
  168 + def do_POST(self):
  169 + """Handle POST requests. Proxy API requests to backend."""
  170 + path = self.path.split('?')[0]
  171 + if self._is_proxy_path(path):
  172 + self._proxy_to_backend()
  173 + return
  174 + self.send_error(405, "Method Not Allowed")
  175 +
  176 + def setup(self):
  177 + """Setup with error handling."""
  178 + try:
  179 + super().setup()
  180 + except Exception:
  181 + pass # Silently handle setup errors from scanners
  182 +
  183 + def handle_one_request(self):
  184 + """Handle single request with error catching."""
  185 + try:
  186 + # Check rate limiting
  187 + client_ip = self.client_address[0]
  188 + if self.is_rate_limited(client_ip):
  189 + logging.warning(f"Rate limiting IP: {client_ip}")
  190 + self.send_error(429, "Too Many Requests")
  191 + return
  192 +
  193 + super().handle_one_request()
  194 + except (ConnectionResetError, BrokenPipeError):
  195 + # Client disconnected prematurely - common with scanners
  196 + pass
  197 + except UnicodeDecodeError:
  198 + # Binary data received - not HTTP
  199 + pass
  200 + except Exception as e:
  201 + # Log unexpected errors but don't crash
  202 + logging.debug(f"Request handling error: {e}")
  203 +
  204 + def log_message(self, format, *args):
  205 + """Suppress logging for malformed requests from scanners."""
  206 + message = format % args
  207 + # Filter out scanner noise
  208 + noise_patterns = [
  209 + "code 400",
  210 + "Bad request",
  211 + "Bad request version",
  212 + "Bad HTTP/0.9 request type",
  213 + "Bad request syntax"
  214 + ]
  215 + if any(pattern in message for pattern in noise_patterns):
  216 + return
  217 + # Only log legitimate requests
  218 + if message and not message.startswith(" ") and len(message) > 10:
  219 + super().log_message(format, *args)
  220 +
  221 + def end_headers(self):
  222 + # Add CORS headers
  223 + self.send_header('Access-Control-Allow-Origin', '*')
  224 + self.send_header('Access-Control-Allow-Methods', 'GET, POST, OPTIONS')
  225 + self.send_header('Access-Control-Allow-Headers', self._ALLOWED_CORS_HEADERS)
  226 + # Add security headers
  227 + self.send_header('X-Content-Type-Options', 'nosniff')
  228 + self.send_header('X-Frame-Options', 'DENY')
  229 + self.send_header('X-XSS-Protection', '1; mode=block')
  230 + super().end_headers()
  231 +
  232 + def do_OPTIONS(self):
  233 + """Handle OPTIONS requests."""
  234 + try:
  235 + path = self.path.split('?')[0]
  236 + if self._is_proxy_path(path):
  237 + self.send_response(204)
  238 + self.end_headers()
  239 + return
  240 + self.send_response(200)
  241 + self.end_headers()
  242 + except Exception:
  243 + pass
  244 +
  245 +class ThreadedTCPServer(socketserver.ThreadingMixIn, socketserver.TCPServer):
  246 + """Threaded TCP server with better error handling."""
  247 + allow_reuse_address = True
  248 + daemon_threads = True
  249 +
  250 +if __name__ == '__main__':
  251 + # Check if port is already in use
  252 + import socket
  253 + sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
  254 + try:
  255 + sock.bind(("", PORT))
  256 + sock.close()
  257 + except OSError:
  258 + print(f"ERROR: Port {PORT} is already in use.")
  259 + print(f"Please stop the existing server or use a different port.")
  260 + print(f"To stop existing server: kill $(lsof -t -i:{PORT})")
  261 + sys.exit(1)
  262 +
  263 + # Create threaded server for better concurrency
  264 + with ThreadedTCPServer(("", PORT), MyHTTPRequestHandler) as httpd:
  265 + print(f"Frontend server started at http://localhost:{PORT}")
  266 + print(f"Serving files from: {os.getcwd()}")
  267 + print("\nPress Ctrl+C to stop the server")
  268 +
  269 + try:
  270 + httpd.serve_forever()
  271 + except KeyboardInterrupt:
  272 + print("\nShutting down server...")
  273 + httpd.shutdown()
  274 + print("Server stopped")
  275 + sys.exit(0)
  276 + except Exception as e:
  277 + print(f"Server error: {e}")
  278 + sys.exit(1)
... ...
scripts/frontend_server.py 100755 → 100644
1 1 #!/usr/bin/env python3
2   -"""
3   -Simple HTTP server for saas-search frontend.
4   -"""
  2 +"""Backward-compatible frontend server entrypoint."""
5 3  
6   -import http.server
7   -import socketserver
8   -import os
9   -import sys
10   -import logging
11   -import time
12   -import urllib.request
13   -import urllib.error
14   -from collections import defaultdict, deque
15   -from pathlib import Path
16   -from dotenv import load_dotenv
17   -
18   -# Load .env file
19   -project_root = Path(__file__).parent.parent
20   -load_dotenv(project_root / '.env')
21   -
22   -# Get API_BASE_URL from environment(默认不注入,避免被旧 .env 覆盖同源策略)
23   -# 仅当显式设置 FRONTEND_INJECT_API_BASE_URL=1 时才注入 window.API_BASE_URL。
24   -API_BASE_URL = os.getenv('API_BASE_URL') or None
25   -INJECT_API_BASE_URL = os.getenv('FRONTEND_INJECT_API_BASE_URL', '0') == '1'
26   -# Backend proxy target for same-origin API forwarding
27   -BACKEND_PROXY_URL = os.getenv('BACKEND_PROXY_URL', 'http://127.0.0.1:6002').rstrip('/')
28   -
29   -# Change to frontend directory
30   -frontend_dir = os.path.join(os.path.dirname(__file__), '../frontend')
31   -os.chdir(frontend_dir)
32   -
33   -# FRONTEND_PORT is the canonical config; keep PORT as a secondary fallback.
34   -PORT = int(os.getenv('FRONTEND_PORT', os.getenv('PORT', 6003)))
35   -
36   -# Configure logging to suppress scanner noise
37   -logging.basicConfig(level=logging.ERROR, format='%(asctime)s - %(levelname)s - %(message)s')
38   -
39   -class RateLimitingMixin:
40   - """Mixin for rate limiting requests by IP address."""
41   - request_counts = defaultdict(deque)
42   - rate_limit = 100 # requests per minute
43   - window = 60 # seconds
44   -
45   - @classmethod
46   - def is_rate_limited(cls, ip):
47   - now = time.time()
48   -
49   - # Clean old requests
50   - while cls.request_counts[ip] and cls.request_counts[ip][0] < now - cls.window:
51   - cls.request_counts[ip].popleft()
52   -
53   - # Check rate limit
54   - if len(cls.request_counts[ip]) > cls.rate_limit:
55   - return True
56   -
57   - cls.request_counts[ip].append(now)
58   - return False
59   -
60   -class MyHTTPRequestHandler(http.server.SimpleHTTPRequestHandler, RateLimitingMixin):
61   - """Custom request handler with CORS support and robust error handling."""
62   -
63   - def _is_proxy_path(self, path: str) -> bool:
64   - """Return True for API paths that should be forwarded to backend service."""
65   - return path.startswith('/search/') or path.startswith('/admin/') or path.startswith('/indexer/')
66   -
67   - def _proxy_to_backend(self):
68   - """Proxy current request to backend service on the GPU server."""
69   - target_url = f"{BACKEND_PROXY_URL}{self.path}"
70   - method = self.command.upper()
71   -
72   - try:
73   - content_length = int(self.headers.get('Content-Length', '0'))
74   - except ValueError:
75   - content_length = 0
76   - body = self.rfile.read(content_length) if content_length > 0 else None
  4 +from __future__ import annotations
77 5  
78   - forward_headers = {}
79   - for key, value in self.headers.items():
80   - lk = key.lower()
81   - if lk in ('host', 'content-length', 'connection'):
82   - continue
83   - forward_headers[key] = value
84   -
85   - req = urllib.request.Request(
86   - target_url,
87   - data=body,
88   - headers=forward_headers,
89   - method=method,
90   - )
91   -
92   - try:
93   - with urllib.request.urlopen(req, timeout=30) as resp:
94   - resp_body = resp.read()
95   - self.send_response(resp.getcode())
96   - for header, value in resp.getheaders():
97   - lh = header.lower()
98   - if lh in ('transfer-encoding', 'connection', 'content-length'):
99   - continue
100   - self.send_header(header, value)
101   - self.end_headers()
102   - self.wfile.write(resp_body)
103   - except urllib.error.HTTPError as e:
104   - err_body = e.read() if hasattr(e, 'read') else b''
105   - self.send_response(e.code)
106   - if e.headers:
107   - for header, value in e.headers.items():
108   - lh = header.lower()
109   - if lh in ('transfer-encoding', 'connection', 'content-length'):
110   - continue
111   - self.send_header(header, value)
112   - self.end_headers()
113   - if err_body:
114   - self.wfile.write(err_body)
115   - except Exception as e:
116   - logging.error(f"Backend proxy error for {method} {self.path}: {e}")
117   - self.send_response(502)
118   - self.send_header('Content-Type', 'application/json; charset=utf-8')
119   - self.end_headers()
120   - self.wfile.write(b'{"error":"Bad Gateway: backend proxy failed"}')
121   -
122   - def do_GET(self):
123   - """Handle GET requests with API config injection."""
124   - path = self.path.split('?')[0]
125   -
126   - # Proxy API paths to backend first
127   - if self._is_proxy_path(path):
128   - self._proxy_to_backend()
129   - return
130   -
131   - # Route / to index.html
132   - if path == '/' or path == '':
133   - self.path = '/index.html' + (self.path.split('?', 1)[1] if '?' in self.path else '')
134   -
135   - # Inject API config for HTML files
136   - if self.path.endswith('.html'):
137   - self._serve_html_with_config()
138   - else:
139   - super().do_GET()
140   -
141   - def _serve_html_with_config(self):
142   - """Serve HTML with optional API_BASE_URL injected."""
143   - try:
144   - file_path = self.path.lstrip('/')
145   - if not os.path.exists(file_path):
146   - self.send_error(404)
147   - return
148   -
149   - with open(file_path, 'r', encoding='utf-8') as f:
150   - html = f.read()
151   -
152   - # 默认不注入 API_BASE_URL,避免历史 .env(如 http://xx:6002)覆盖同源调用。
153   - # 仅当 FRONTEND_INJECT_API_BASE_URL=1 且 API_BASE_URL 有值时才注入。
154   - if INJECT_API_BASE_URL and API_BASE_URL:
155   - config_script = f'<script>window.API_BASE_URL="{API_BASE_URL}";</script>\n '
156   - html = html.replace('<script src="/static/js/app.js', config_script + '<script src="/static/js/app.js', 1)
157   -
158   - self.send_response(200)
159   - self.send_header('Content-Type', 'text/html; charset=utf-8')
160   - self.end_headers()
161   - self.wfile.write(html.encode('utf-8'))
162   - except Exception as e:
163   - logging.error(f"Error serving HTML: {e}")
164   - self.send_error(500)
165   -
166   - def do_POST(self):
167   - """Handle POST requests. Proxy API requests to backend."""
168   - path = self.path.split('?')[0]
169   - if self._is_proxy_path(path):
170   - self._proxy_to_backend()
171   - return
172   - self.send_error(405, "Method Not Allowed")
173   -
174   - def setup(self):
175   - """Setup with error handling."""
176   - try:
177   - super().setup()
178   - except Exception:
179   - pass # Silently handle setup errors from scanners
180   -
181   - def handle_one_request(self):
182   - """Handle single request with error catching."""
183   - try:
184   - # Check rate limiting
185   - client_ip = self.client_address[0]
186   - if self.is_rate_limited(client_ip):
187   - logging.warning(f"Rate limiting IP: {client_ip}")
188   - self.send_error(429, "Too Many Requests")
189   - return
190   -
191   - super().handle_one_request()
192   - except (ConnectionResetError, BrokenPipeError):
193   - # Client disconnected prematurely - common with scanners
194   - pass
195   - except UnicodeDecodeError:
196   - # Binary data received - not HTTP
197   - pass
198   - except Exception as e:
199   - # Log unexpected errors but don't crash
200   - logging.debug(f"Request handling error: {e}")
201   -
202   - def log_message(self, format, *args):
203   - """Suppress logging for malformed requests from scanners."""
204   - message = format % args
205   - # Filter out scanner noise
206   - noise_patterns = [
207   - "code 400",
208   - "Bad request",
209   - "Bad request version",
210   - "Bad HTTP/0.9 request type",
211   - "Bad request syntax"
212   - ]
213   - if any(pattern in message for pattern in noise_patterns):
214   - return
215   - # Only log legitimate requests
216   - if message and not message.startswith(" ") and len(message) > 10:
217   - super().log_message(format, *args)
218   -
219   - def end_headers(self):
220   - # Add CORS headers
221   - self.send_header('Access-Control-Allow-Origin', '*')
222   - self.send_header('Access-Control-Allow-Methods', 'GET, POST, OPTIONS')
223   - self.send_header('Access-Control-Allow-Headers', 'Content-Type')
224   - # Add security headers
225   - self.send_header('X-Content-Type-Options', 'nosniff')
226   - self.send_header('X-Frame-Options', 'DENY')
227   - self.send_header('X-XSS-Protection', '1; mode=block')
228   - super().end_headers()
229   -
230   - def do_OPTIONS(self):
231   - """Handle OPTIONS requests."""
232   - try:
233   - path = self.path.split('?')[0]
234   - if self._is_proxy_path(path):
235   - self.send_response(204)
236   - self.end_headers()
237   - return
238   - self.send_response(200)
239   - self.end_headers()
240   - except Exception:
241   - pass
242   -
243   -class ThreadedTCPServer(socketserver.ThreadingMixIn, socketserver.TCPServer):
244   - """Threaded TCP server with better error handling."""
245   - allow_reuse_address = True
246   - daemon_threads = True
  6 +import runpy
  7 +from pathlib import Path
247 8  
248   -if __name__ == '__main__':
249   - # Check if port is already in use
250   - import socket
251   - sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
252   - try:
253   - sock.bind(("", PORT))
254   - sock.close()
255   - except OSError:
256   - print(f"ERROR: Port {PORT} is already in use.")
257   - print(f"Please stop the existing server or use a different port.")
258   - print(f"To stop existing server: kill $(lsof -t -i:{PORT})")
259   - sys.exit(1)
260   -
261   - # Create threaded server for better concurrency
262   - with ThreadedTCPServer(("", PORT), MyHTTPRequestHandler) as httpd:
263   - print(f"Frontend server started at http://localhost:{PORT}")
264   - print(f"Serving files from: {os.getcwd()}")
265   - print("\nPress Ctrl+C to stop the server")
266 9  
267   - try:
268   - httpd.serve_forever()
269   - except KeyboardInterrupt:
270   - print("\nShutting down server...")
271   - httpd.shutdown()
272   - print("Server stopped")
273   - sys.exit(0)
274   - except Exception as e:
275   - print(f"Server error: {e}")
276   - sys.exit(1)
  10 +if __name__ == "__main__":
  11 + target = Path(__file__).resolve().parent / "frontend" / "frontend_server.py"
  12 + runpy.run_path(str(target), run_name="__main__")
... ...
scripts/inspect/README.md 0 → 100644
... ... @@ -0,0 +1,10 @@
  1 +# Inspect Scripts
  2 +
  3 +这一组脚本用于做一次性诊断、索引检查和数据核对:
  4 +
  5 +- `check_data_source.py`
  6 +- `check_es_data.py`
  7 +- `check_index_mapping.py`
  8 +- `compare_index_mappings.py`
  9 +
  10 +它们依赖真实 DB / ES 环境,不属于 CI 测试或 benchmark。
... ...
scripts/check_data_source.py renamed to scripts/inspect/check_data_source.py
... ... @@ -14,8 +14,8 @@ import argparse
14 14 from pathlib import Path
15 15 from sqlalchemy import create_engine, text
16 16  
17   -# Add parent directory to path
18   -sys.path.insert(0, str(Path(__file__).parent.parent))
  17 +# Add repo root to path
  18 +sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
19 19  
20 20 from utils.db_connector import create_db_connection
21 21  
... ... @@ -298,4 +298,3 @@ def main():
298 298  
299 299 if __name__ == '__main__':
300 300 sys.exit(main())
301   -
... ...
scripts/check_es_data.py renamed to scripts/inspect/check_es_data.py
... ... @@ -8,7 +8,7 @@ import os
8 8 import argparse
9 9 from pathlib import Path
10 10  
11   -sys.path.insert(0, str(Path(__file__).parent.parent))
  11 +sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
12 12  
13 13 from utils.es_client import ESClient
14 14  
... ... @@ -265,4 +265,3 @@ def main():
265 265  
266 266 if __name__ == '__main__':
267 267 sys.exit(main())
268   -
... ...
scripts/check_index_mapping.py renamed to scripts/inspect/check_index_mapping.py
... ... @@ -8,7 +8,7 @@ import sys
8 8 import json
9 9 from pathlib import Path
10 10  
11   -sys.path.insert(0, str(Path(__file__).parent.parent))
  11 +sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
12 12  
13 13 from utils.es_client import get_es_client_from_env
14 14 from indexer.mapping_generator import get_tenant_index_name
... ...
scripts/compare_index_mappings.py renamed to scripts/inspect/compare_index_mappings.py
... ... @@ -9,7 +9,7 @@ import json
9 9 from pathlib import Path
10 10 from typing import Dict, Any
11 11  
12   -sys.path.insert(0, str(Path(__file__).parent.parent))
  12 +sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
13 13  
14 14 from utils.es_client import get_es_client_from_env
15 15  
... ... @@ -186,4 +186,3 @@ def main():
186 186  
187 187 if __name__ == '__main__':
188 188 sys.exit(main())
189   -
... ...
scripts/temp_embed_tenant_image_urls.py renamed to scripts/maintenance/embed_tenant_image_urls.py
... ... @@ -5,7 +5,7 @@
5 5  
6 6 用法:
7 7 source activate.sh # 会加载 .env,提供 ES_HOST / ES_USERNAME / ES_PASSWORD
8   - python scripts/temp_embed_tenant_image_urls.py
  8 + python scripts/maintenance/embed_tenant_image_urls.py
9 9  
10 10 未 source 时脚本也会尝试加载项目根目录 .env。
11 11 """
... ... @@ -30,7 +30,7 @@ from elasticsearch.helpers import scan
30 30 try:
31 31 from dotenv import load_dotenv
32 32  
33   - _ROOT = Path(__file__).resolve().parents[1]
  33 + _ROOT = Path(__file__).resolve().parents[2]
34 34 load_dotenv(_ROOT / ".env")
35 35 except ImportError:
36 36 pass
... ...
scripts/ops/README.md 0 → 100644
... ... @@ -0,0 +1,8 @@
  1 +# Ops Scripts
  2 +
  3 +这一组脚本是服务编排过程中的辅助脚本:
  4 +
  5 +- `daily_log_router.sh`:按天切日志
  6 +- `wechat_alert.py`:监控告警发送
  7 +
  8 +如果其他启动脚本引用这些文件,应通过这里的固定路径,不要再复制出新的同类工具。
... ...
scripts/daily_log_router.sh renamed to scripts/ops/daily_log_router.sh
... ... @@ -3,7 +3,7 @@
3 3 # Route incoming log stream into per-day files.
4 4 #
5 5 # Usage:
6   -# command 2>&1 | ./scripts/daily_log_router.sh <service> <log_dir> [retention_days]
  6 +# command 2>&1 | ./scripts/ops/daily_log_router.sh <service> <log_dir> [retention_days]
7 7 #
8 8  
9 9 set -euo pipefail
... ...
scripts/wechat_alert.py renamed to scripts/ops/wechat_alert.py
... ... @@ -6,7 +6,7 @@ This module is intentionally small and focused so that Bash-based monitors
6 6 can invoke it without pulling in the full application stack.
7 7  
8 8 Usage example:
9   - python scripts/wechat_alert.py --service backend --level error --message "backend restarted"
  9 + python scripts/ops/wechat_alert.py --service backend --level error --message "backend restarted"
10 10 """
11 11  
12 12 import argparse
... ... @@ -101,4 +101,3 @@ def main(argv: list[str] | None = None) -&gt; int:
101 101  
102 102 if __name__ == "__main__":
103 103 raise SystemExit(main())
104   -
... ...
scripts/monitor_eviction.py renamed to scripts/redis/monitor_eviction.py
... ... @@ -12,7 +12,7 @@ from pathlib import Path
12 12 from datetime import datetime
13 13  
14 14 # 添加项目路径
15   -project_root = Path(__file__).parent.parent
  15 +project_root = Path(__file__).resolve().parents[2]
16 16 sys.path.insert(0, str(project_root))
17 17  
18 18 from config.env_config import REDIS_CONFIG
... ...
scripts/service_ctl.sh
... ... @@ -20,6 +20,7 @@ CORE_SERVICES=(&quot;backend&quot; &quot;indexer&quot; &quot;frontend&quot; &quot;eval-web&quot;)
20 20 OPTIONAL_SERVICES=("tei" "cnclip" "embedding" "embedding-image" "translator" "reranker")
21 21 FULL_SERVICES=("${OPTIONAL_SERVICES[@]}" "${CORE_SERVICES[@]}")
22 22 STOP_ORDER_SERVICES=("frontend" "eval-web" "indexer" "backend" "reranker" "translator" "embedding-image" "embedding" "cnclip" "tei")
  23 +declare -Ag SERVICE_ENABLED_CACHE=()
23 24  
24 25 all_services() {
25 26 echo "${FULL_SERVICES[@]}"
... ... @@ -33,6 +34,72 @@ config_python_bin() {
33 34 fi
34 35 }
35 36  
  37 +service_enabled_by_config() {
  38 + local service="$1"
  39 + case "${service}" in
  40 + reranker|reranker-fine|translator)
  41 + ;;
  42 + *)
  43 + return 0
  44 + ;;
  45 + esac
  46 +
  47 + if [ -n "${SERVICE_ENABLED_CACHE[${service}]+x}" ]; then
  48 + [ "${SERVICE_ENABLED_CACHE[${service}]}" = "1" ]
  49 + return
  50 + fi
  51 +
  52 + local pybin
  53 + pybin="$(config_python_bin)"
  54 +
  55 + local enabled
  56 + if ! enabled="$(
  57 + SERVICE_NAME="${service}" \
  58 + PYTHONPATH="${PROJECT_ROOT}${PYTHONPATH:+:${PYTHONPATH}}" \
  59 + "${pybin}" - <<'PY'
  60 +from config.loader import get_app_config
  61 +import os
  62 +
  63 +service = os.environ["SERVICE_NAME"]
  64 +cfg = get_app_config()
  65 +
  66 +enabled = True
  67 +if service == "reranker":
  68 + enabled = bool(cfg.search.rerank.enabled)
  69 +elif service == "reranker-fine":
  70 + enabled = bool(cfg.search.fine_rank.enabled)
  71 +elif service == "translator":
  72 + capabilities = dict(cfg.services.translation.capabilities or {})
  73 + enabled = any(bool((value or {}).get("enabled", True)) for value in capabilities.values())
  74 +
  75 +print("1" if enabled else "0")
  76 +PY
  77 + )"; then
  78 + echo "[warn] failed to read config state for ${service}; defaulting to enabled" >&2
  79 + enabled="1"
  80 + fi
  81 +
  82 + SERVICE_ENABLED_CACHE["${service}"]="${enabled}"
  83 + [ "${enabled}" = "1" ]
  84 +}
  85 +
  86 +filter_disabled_targets() {
  87 + local targets="$1"
  88 + local verbose="${2:-quiet}"
  89 + local out=""
  90 + local svc
  91 +
  92 + for svc in ${targets}; do
  93 + if service_enabled_by_config "${svc}"; then
  94 + out="${out} ${svc}"
  95 + elif [ "${verbose}" = "verbose" ]; then
  96 + echo "[skip] ${svc} disabled by config" >&2
  97 + fi
  98 + done
  99 +
  100 + echo "${out# }"
  101 +}
  102 +
36 103 reranker_instance_for_service() {
37 104 local service="$1"
38 105 case "${service}" in
... ... @@ -334,7 +401,7 @@ monitor_services() {
334 401 local fail_threshold="${MONITOR_FAIL_THRESHOLD:-3}"
335 402 local restart_cooldown_sec="${MONITOR_RESTART_COOLDOWN_SEC:-30}"
336 403 local max_restarts_per_hour="${MONITOR_MAX_RESTARTS_PER_HOUR:-6}"
337   - local wechat_alert_py="${PROJECT_ROOT}/scripts/wechat_alert.py"
  404 + local wechat_alert_py="${PROJECT_ROOT}/scripts/ops/wechat_alert.py"
338 405  
339 406 require_positive_int "MONITOR_INTERVAL_SEC" "${interval_sec}"
340 407 require_positive_int "MONITOR_FAIL_THRESHOLD" "${fail_threshold}"
... ... @@ -468,6 +535,16 @@ stop_monitor_daemon() {
468 535  
469 536 start_monitor_daemon() {
470 537 local targets="$1"
  538 + if [ -z "${targets}" ]; then
  539 + if is_monitor_daemon_running; then
  540 + echo "[info] no enabled services to monitor; stopping monitor daemon"
  541 + stop_monitor_daemon
  542 + else
  543 + echo "[info] no enabled services to monitor"
  544 + fi
  545 + return 0
  546 + fi
  547 +
471 548 local pf
472 549 pf="$(monitor_pid_file)"
473 550 local tf
... ... @@ -581,6 +658,10 @@ wait_for_startup_health() {
581 658 start_one() {
582 659 local service="$1"
583 660 cd "${PROJECT_ROOT}"
  661 + if ! service_enabled_by_config "${service}"; then
  662 + echo "[skip] ${service} disabled by config"
  663 + return 0
  664 + fi
584 665 local cmd
585 666 if ! cmd="$(service_start_cmd "${service}")"; then
586 667 echo "[error] unknown service: ${service}" >&2
... ... @@ -953,6 +1034,7 @@ main() {
953 1034  
954 1035 load_env_file "${PROJECT_ROOT}/.env"
955 1036 local targets=""
  1037 + local effective_targets=""
956 1038 local monitor_was_running=0
957 1039 local monitor_prev_targets=""
958 1040 local auto_monitor_on_start="${SERVICE_CTL_AUTO_MONITOR_ON_START:-1}"
... ... @@ -976,12 +1058,23 @@ main() {
976 1058 ;;
977 1059 esac
978 1060  
  1061 + effective_targets="${targets}"
  1062 + case "${action}" in
  1063 + up|start|restart|monitor|monitor-start)
  1064 + effective_targets="$(filter_disabled_targets "${targets}" "verbose")"
  1065 + ;;
  1066 + esac
  1067 +
979 1068 case "${action}" in
980 1069 up)
981   - for svc in ${targets}; do
  1070 + if [ -z "${effective_targets}" ]; then
  1071 + echo "[info] no enabled services in target set"
  1072 + exit 0
  1073 + fi
  1074 + for svc in ${effective_targets}; do
982 1075 start_one "${svc}"
983 1076 done
984   - start_monitor_daemon "${targets}"
  1077 + start_monitor_daemon "${effective_targets}"
985 1078 ;;
986 1079 down)
987 1080 stop_monitor_daemon
... ... @@ -990,11 +1083,15 @@ main() {
990 1083 done
991 1084 ;;
992 1085 start)
993   - for svc in ${targets}; do
  1086 + if [ -z "${effective_targets}" ]; then
  1087 + echo "[info] no enabled services in target set"
  1088 + exit 0
  1089 + fi
  1090 + for svc in ${effective_targets}; do
994 1091 start_one "${svc}"
995 1092 done
996 1093 if [ "${auto_monitor_on_start}" = "1" ]; then
997   - start_monitor_daemon "$(merge_targets "$(monitor_current_targets)" "${targets}")"
  1094 + start_monitor_daemon "$(merge_targets "$(monitor_current_targets)" "${effective_targets}")"
998 1095 fi
999 1096 ;;
1000 1097 stop)
... ... @@ -1025,16 +1122,17 @@ main() {
1025 1122 for svc in ${restart_stop_targets}; do
1026 1123 stop_one "${svc}"
1027 1124 done
1028   - for svc in ${targets}; do
  1125 + for svc in ${effective_targets}; do
1029 1126 start_one "${svc}"
1030 1127 done
1031 1128 if [ "${monitor_was_running}" -eq 1 ]; then
1032 1129 monitor_prev_targets="$(normalize_targets "${monitor_prev_targets}")"
  1130 + monitor_prev_targets="$(filter_disabled_targets "${monitor_prev_targets}" "quiet")"
1033 1131 monitor_prev_targets="$(apply_target_order monitor "${monitor_prev_targets}")"
1034   - [ -z "${monitor_prev_targets}" ] && monitor_prev_targets="${targets}"
  1132 + [ -z "${monitor_prev_targets}" ] && monitor_prev_targets="${effective_targets}"
1035 1133 start_monitor_daemon "${monitor_prev_targets}"
1036 1134 elif [ "${auto_monitor_on_start}" = "1" ]; then
1037   - start_monitor_daemon "$(merge_targets "$(monitor_current_targets)" "${targets}")"
  1135 + start_monitor_daemon "$(merge_targets "$(monitor_current_targets)" "${effective_targets}")"
1038 1136 fi
1039 1137 ;;
1040 1138 status)
... ... @@ -1044,10 +1142,14 @@ main() {
1044 1142 monitor_daemon_status
1045 1143 ;;
1046 1144 monitor)
1047   - monitor_services "${targets}"
  1145 + if [ -z "${effective_targets}" ]; then
  1146 + echo "[info] no enabled services in target set"
  1147 + exit 0
  1148 + fi
  1149 + monitor_services "${effective_targets}"
1048 1150 ;;
1049 1151 monitor-start)
1050   - start_monitor_daemon "${targets}"
  1152 + start_monitor_daemon "${effective_targets}"
1051 1153 ;;
1052 1154 monitor-stop)
1053 1155 stop_monitor_daemon
... ...
scripts/setup_translator_venv.sh
... ... @@ -8,8 +8,47 @@ PROJECT_ROOT=&quot;$(cd &quot;$(dirname &quot;$0&quot;)/..&quot; &amp;&amp; pwd)&quot;
8 8 cd "${PROJECT_ROOT}"
9 9  
10 10 VENV_DIR="${PROJECT_ROOT}/.venv-translator"
11   -PYTHON_BIN="${PYTHON_BIN:-python3}"
12 11 TMP_DIR="${TRANSLATOR_PIP_TMPDIR:-${PROJECT_ROOT}/.tmp/translator-pip}"
  12 +MIN_PYTHON_MAJOR=3
  13 +MIN_PYTHON_MINOR=10
  14 +
  15 +python_meets_minimum() {
  16 + local bin="$1"
  17 + "${bin}" - <<'PY' "${MIN_PYTHON_MAJOR}" "${MIN_PYTHON_MINOR}"
  18 +import sys
  19 +
  20 +required = tuple(int(value) for value in sys.argv[1:])
  21 +sys.exit(0 if sys.version_info[:2] >= required else 1)
  22 +PY
  23 +}
  24 +
  25 +discover_python_bin() {
  26 + local candidates=()
  27 +
  28 + if [[ -n "${PYTHON_BIN:-}" ]]; then
  29 + candidates+=("${PYTHON_BIN}")
  30 + fi
  31 + candidates+=("python3.12" "python3.11" "python3.10" "python3")
  32 +
  33 + local candidate
  34 + for candidate in "${candidates[@]}"; do
  35 + if ! command -v "${candidate}" >/dev/null 2>&1; then
  36 + continue
  37 + fi
  38 + if python_meets_minimum "${candidate}"; then
  39 + echo "${candidate}"
  40 + return 0
  41 + fi
  42 + done
  43 +
  44 + return 1
  45 +}
  46 +
  47 +if ! PYTHON_BIN="$(discover_python_bin)"; then
  48 + echo "ERROR: unable to find Python >= ${MIN_PYTHON_MAJOR}.${MIN_PYTHON_MINOR}." >&2
  49 + echo "Set PYTHON_BIN to a compatible interpreter and rerun." >&2
  50 + exit 1
  51 +fi
13 52  
14 53 if ! command -v "${PYTHON_BIN}" >/dev/null 2>&1; then
15 54 echo "ERROR: python not found: ${PYTHON_BIN}" >&2
... ... @@ -32,6 +71,7 @@ mkdir -p &quot;${TMP_DIR}&quot;
32 71 export TMPDIR="${TMP_DIR}"
33 72 PIP_ARGS=(--no-cache-dir)
34 73  
  74 +echo "Using Python=${PYTHON_BIN}"
35 75 echo "Using TMPDIR=${TMPDIR}"
36 76 "${VENV_DIR}/bin/python" -m pip install "${PIP_ARGS[@]}" --upgrade pip wheel
37 77 "${VENV_DIR}/bin/python" -m pip install "${PIP_ARGS[@]}" -r requirements_translator_service.txt
... ... @@ -39,5 +79,5 @@ echo &quot;Using TMPDIR=${TMPDIR}&quot;
39 79 echo
40 80 echo "Done."
41 81 echo "Translator venv: ${VENV_DIR}"
42   -echo "Download local models: ./.venv-translator/bin/python scripts/download_translation_models.py --all-local"
  82 +echo "Download local models: ./.venv-translator/bin/python scripts/translation/download_translation_models.py --all-local"
43 83 echo "Start service: ./scripts/start_translator.sh"
... ...
scripts/start_cnclip_service.sh
... ... @@ -61,7 +61,7 @@ LOG_DIR=&quot;${PROJECT_ROOT}/logs&quot;
61 61 PID_FILE="${LOG_DIR}/cnclip.pid"
62 62 LOG_LINK="${LOG_DIR}/cnclip.log"
63 63 LOG_FILE="${LOG_DIR}/cnclip-$(date +%F).log"
64   -LOG_ROUTER_SCRIPT="${PROJECT_ROOT}/scripts/daily_log_router.sh"
  64 +LOG_ROUTER_SCRIPT="${PROJECT_ROOT}/scripts/ops/daily_log_router.sh"
65 65  
66 66 # 帮助信息
67 67 show_help() {
... ...
scripts/start_frontend.sh
... ... @@ -27,4 +27,4 @@ echo -e &quot; ${GREEN}http://localhost:${API_PORT}${NC}&quot;
27 27 echo ""
28 28  
29 29 export FRONTEND_PORT API_PORT PORT
30   -exec python scripts/frontend_server.py
  30 +exec python scripts/frontend/frontend_server.py
... ...
scripts/translation/download_translation_models.py 0 → 100755
... ... @@ -0,0 +1,100 @@
  1 +#!/usr/bin/env python3
  2 +"""Download local translation models declared in services.translation.capabilities."""
  3 +
  4 +from __future__ import annotations
  5 +
  6 +import argparse
  7 +import os
  8 +from pathlib import Path
  9 +import sys
  10 +from typing import Iterable
  11 +
  12 +from huggingface_hub import snapshot_download
  13 +
  14 +PROJECT_ROOT = Path(__file__).resolve().parents[2]
  15 +if str(PROJECT_ROOT) not in sys.path:
  16 + sys.path.insert(0, str(PROJECT_ROOT))
  17 +os.environ.setdefault("HF_HUB_DISABLE_XET", "1")
  18 +
  19 +from config.services_config import get_translation_config
  20 +from translation.ct2_conversion import convert_transformers_model
  21 +
  22 +
  23 +LOCAL_BACKENDS = {"local_nllb", "local_marian"}
  24 +
  25 +
  26 +def iter_local_capabilities(selected: set[str] | None = None) -> Iterable[tuple[str, dict]]:
  27 + cfg = get_translation_config()
  28 + capabilities = cfg.get("capabilities", {}) if isinstance(cfg, dict) else {}
  29 + for name, capability in capabilities.items():
  30 + backend = str(capability.get("backend") or "").strip().lower()
  31 + if backend not in LOCAL_BACKENDS:
  32 + continue
  33 + if selected and name not in selected:
  34 + continue
  35 + yield name, capability
  36 +
  37 +
  38 +def _compute_ct2_output_dir(capability: dict) -> Path:
  39 + custom = str(capability.get("ct2_model_dir") or "").strip()
  40 + if custom:
  41 + return Path(custom).expanduser()
  42 + model_dir = Path(str(capability.get("model_dir") or "")).expanduser()
  43 + compute_type = str(capability.get("ct2_compute_type") or capability.get("torch_dtype") or "default").strip().lower()
  44 + normalized = compute_type.replace("_", "-")
  45 + return model_dir / f"ctranslate2-{normalized}"
  46 +
  47 +
  48 +def convert_to_ctranslate2(name: str, capability: dict) -> None:
  49 + model_id = str(capability.get("model_id") or "").strip()
  50 + model_dir = Path(str(capability.get("model_dir") or "")).expanduser()
  51 + model_source = str(model_dir if model_dir.exists() else model_id)
  52 + output_dir = _compute_ct2_output_dir(capability)
  53 + if (output_dir / "model.bin").exists():
  54 + print(f"[skip-convert] {name} -> {output_dir}")
  55 + return
  56 + quantization = str(
  57 + capability.get("ct2_conversion_quantization")
  58 + or capability.get("ct2_compute_type")
  59 + or capability.get("torch_dtype")
  60 + or "default"
  61 + ).strip()
  62 + output_dir.parent.mkdir(parents=True, exist_ok=True)
  63 + print(f"[convert] {name} -> {output_dir} ({quantization})")
  64 + convert_transformers_model(model_source, str(output_dir), quantization)
  65 + print(f"[converted] {name}")
  66 +
  67 +
  68 +def main() -> None:
  69 + parser = argparse.ArgumentParser(description="Download local translation models")
  70 + parser.add_argument("--all-local", action="store_true", help="Download all configured local translation models")
  71 + parser.add_argument("--models", nargs="*", default=[], help="Specific capability names to download")
  72 + parser.add_argument(
  73 + "--convert-ctranslate2",
  74 + action="store_true",
  75 + help="Also convert the downloaded Hugging Face models into CTranslate2 format",
  76 + )
  77 + args = parser.parse_args()
  78 +
  79 + selected = {item.strip().lower() for item in args.models if item.strip()} or None
  80 + if not args.all_local and not selected:
  81 + parser.error("pass --all-local or --models <name> ...")
  82 +
  83 + for name, capability in iter_local_capabilities(selected):
  84 + model_id = str(capability.get("model_id") or "").strip()
  85 + model_dir = Path(str(capability.get("model_dir") or "")).expanduser()
  86 + if not model_id or not model_dir:
  87 + raise ValueError(f"Capability '{name}' must define model_id and model_dir")
  88 + model_dir.parent.mkdir(parents=True, exist_ok=True)
  89 + print(f"[download] {name} -> {model_dir} ({model_id})")
  90 + snapshot_download(
  91 + repo_id=model_id,
  92 + local_dir=str(model_dir),
  93 + )
  94 + print(f"[done] {name}")
  95 + if args.convert_ctranslate2:
  96 + convert_to_ctranslate2(name, capability)
  97 +
  98 +
  99 +if __name__ == "__main__":
  100 + main()
... ...
search/es_query_builder.py
... ... @@ -8,6 +8,7 @@ Simplified architecture:
8 8 - function_score wrapper for boosting fields
9 9 """
10 10  
  11 +from dataclasses import dataclass
11 12 from typing import Dict, Any, List, Optional, Tuple
12 13  
13 14 import numpy as np
... ... @@ -114,6 +115,171 @@ class ESQueryBuilder:
114 115 self.phrase_match_tie_breaker = float(phrase_match_tie_breaker)
115 116 self.phrase_match_boost = float(phrase_match_boost)
116 117  
  118 + @dataclass(frozen=True)
  119 + class KNNClausePlan:
  120 + field: str
  121 + boost: float
  122 + k: Optional[int] = None
  123 + num_candidates: Optional[int] = None
  124 + nested_path: Optional[str] = None
  125 +
  126 + @staticmethod
  127 + def _vector_to_list(vector: Any) -> List[float]:
  128 + if vector is None:
  129 + return []
  130 + if hasattr(vector, "tolist"):
  131 + values = vector.tolist()
  132 + else:
  133 + values = list(vector)
  134 + return [float(v) for v in values]
  135 +
  136 + @staticmethod
  137 + def _query_token_count(parsed_query: Optional[Any]) -> int:
  138 + if parsed_query is None:
  139 + return 0
  140 + query_tokens = getattr(parsed_query, "query_tokens", None) or []
  141 + return len(query_tokens)
  142 +
  143 + def get_text_knn_plan(self, parsed_query: Optional[Any] = None) -> Optional[KNNClausePlan]:
  144 + if not self.text_embedding_field:
  145 + return None
  146 + boost = self.knn_text_boost
  147 + final_knn_k = self.knn_text_k
  148 + final_knn_num_candidates = self.knn_text_num_candidates
  149 + if self._query_token_count(parsed_query) >= 5:
  150 + final_knn_k = self.knn_text_k_long
  151 + final_knn_num_candidates = self.knn_text_num_candidates_long
  152 + boost = self.knn_text_boost * 1.4
  153 + return self.KNNClausePlan(
  154 + field=str(self.text_embedding_field),
  155 + boost=float(boost),
  156 + k=int(final_knn_k),
  157 + num_candidates=int(final_knn_num_candidates),
  158 + )
  159 +
  160 + def get_image_knn_plan(self) -> Optional[KNNClausePlan]:
  161 + if not self.image_embedding_field:
  162 + return None
  163 + nested_path, _, _ = str(self.image_embedding_field).rpartition(".")
  164 + return self.KNNClausePlan(
  165 + field=str(self.image_embedding_field),
  166 + boost=float(self.knn_image_boost),
  167 + k=int(self.knn_image_k),
  168 + num_candidates=int(self.knn_image_num_candidates),
  169 + nested_path=nested_path or None,
  170 + )
  171 +
  172 + def build_text_knn_clause(
  173 + self,
  174 + query_vector: Any,
  175 + *,
  176 + parsed_query: Optional[Any] = None,
  177 + query_name: str = "knn_query",
  178 + ) -> Optional[Dict[str, Any]]:
  179 + plan = self.get_text_knn_plan(parsed_query)
  180 + if plan is None or query_vector is None:
  181 + return None
  182 + return {
  183 + "knn": {
  184 + "field": plan.field,
  185 + "query_vector": self._vector_to_list(query_vector),
  186 + "k": plan.k,
  187 + "num_candidates": plan.num_candidates,
  188 + "boost": plan.boost,
  189 + "_name": query_name,
  190 + }
  191 + }
  192 +
  193 + def build_image_knn_clause(
  194 + self,
  195 + image_query_vector: Any,
  196 + *,
  197 + query_name: str = "image_knn_query",
  198 + ) -> Optional[Dict[str, Any]]:
  199 + plan = self.get_image_knn_plan()
  200 + if plan is None or image_query_vector is None:
  201 + return None
  202 + image_knn_query = {
  203 + "field": plan.field,
  204 + "query_vector": self._vector_to_list(image_query_vector),
  205 + "k": plan.k,
  206 + "num_candidates": plan.num_candidates,
  207 + "boost": plan.boost,
  208 + }
  209 + if plan.nested_path:
  210 + return {
  211 + "nested": {
  212 + "path": plan.nested_path,
  213 + "_name": query_name,
  214 + "query": {"knn": image_knn_query},
  215 + "score_mode": "max",
  216 + }
  217 + }
  218 + return {
  219 + "knn": {
  220 + **image_knn_query,
  221 + "_name": query_name,
  222 + }
  223 + }
  224 +
  225 + def build_exact_text_knn_rescore_clause(
  226 + self,
  227 + query_vector: Any,
  228 + *,
  229 + parsed_query: Optional[Any] = None,
  230 + query_name: str = "exact_text_knn_query",
  231 + ) -> Optional[Dict[str, Any]]:
  232 + plan = self.get_text_knn_plan(parsed_query)
  233 + if plan is None or query_vector is None:
  234 + return None
  235 + return {
  236 + "script_score": {
  237 + "_name": query_name,
  238 + "query": {"exists": {"field": plan.field}},
  239 + "script": {
  240 + "source": (
  241 + f"((dotProduct(params.query_vector, '{plan.field}') + 1.0) / 2.0) * params.boost"
  242 + ),
  243 + "params": {
  244 + "query_vector": self._vector_to_list(query_vector),
  245 + "boost": float(plan.boost),
  246 + },
  247 + },
  248 + }
  249 + }
  250 +
  251 + def build_exact_image_knn_rescore_clause(
  252 + self,
  253 + image_query_vector: Any,
  254 + *,
  255 + query_name: str = "exact_image_knn_query",
  256 + ) -> Optional[Dict[str, Any]]:
  257 + plan = self.get_image_knn_plan()
  258 + if plan is None or image_query_vector is None:
  259 + return None
  260 + script_score_query = {
  261 + "query": {"exists": {"field": plan.field}},
  262 + "script": {
  263 + "source": (
  264 + f"((dotProduct(params.query_vector, '{plan.field}') + 1.0) / 2.0) * params.boost"
  265 + ),
  266 + "params": {
  267 + "query_vector": self._vector_to_list(image_query_vector),
  268 + "boost": float(plan.boost),
  269 + },
  270 + },
  271 + }
  272 + if plan.nested_path:
  273 + return {
  274 + "nested": {
  275 + "path": plan.nested_path,
  276 + "_name": query_name,
  277 + "score_mode": "max",
  278 + "query": {"script_score": script_score_query},
  279 + }
  280 + }
  281 + return {"script_score": {"_name": query_name, **script_score_query}}
  282 +
117 283 def _apply_source_filter(self, es_query: Dict[str, Any]) -> None:
118 284 """
119 285 Apply tri-state _source semantics:
... ... @@ -250,52 +416,21 @@ class ESQueryBuilder:
250 416 # 3. Add KNN search clauses alongside lexical clauses under the same bool.should
251 417 # Text KNN: k / num_candidates from config; long queries use *_long and higher boost
252 418 if has_embedding:
253   - text_knn_boost = self.knn_text_boost
254   - final_knn_k = self.knn_text_k
255   - final_knn_num_candidates = self.knn_text_num_candidates
256   - if parsed_query:
257   - query_tokens = getattr(parsed_query, 'query_tokens', None) or []
258   - token_count = len(query_tokens)
259   - if token_count >= 5:
260   - final_knn_k = self.knn_text_k_long
261   - final_knn_num_candidates = self.knn_text_num_candidates_long
262   - text_knn_boost = self.knn_text_boost * 1.4
263   - recall_clauses.append({
264   - "knn": {
265   - "field": self.text_embedding_field,
266   - "query_vector": query_vector.tolist(),
267   - "k": final_knn_k,
268   - "num_candidates": final_knn_num_candidates,
269   - "boost": text_knn_boost,
270   - "_name": "knn_query",
271   - }
272   - })
  419 + text_knn_clause = self.build_text_knn_clause(
  420 + query_vector,
  421 + parsed_query=parsed_query,
  422 + query_name="knn_query",
  423 + )
  424 + if text_knn_clause:
  425 + recall_clauses.append(text_knn_clause)
273 426  
274 427 if has_image_embedding:
275   - nested_path, _, _ = str(self.image_embedding_field).rpartition(".")
276   - image_knn_query = {
277   - "field": self.image_embedding_field,
278   - "query_vector": image_query_vector.tolist(),
279   - "k": self.knn_image_k,
280   - "num_candidates": self.knn_image_num_candidates,
281   - "boost": self.knn_image_boost,
282   - }
283   - if nested_path:
284   - recall_clauses.append({
285   - "nested": {
286   - "path": nested_path,
287   - "_name": "image_knn_query",
288   - "query": {"knn": image_knn_query},
289   - "score_mode": "max",
290   - }
291   - })
292   - else:
293   - recall_clauses.append({
294   - "knn": {
295   - **image_knn_query,
296   - "_name": "image_knn_query",
297   - }
298   - })
  428 + image_knn_clause = self.build_image_knn_clause(
  429 + image_query_vector,
  430 + query_name="image_knn_query",
  431 + )
  432 + if image_knn_clause:
  433 + recall_clauses.append(image_knn_clause)
299 434  
300 435 # 4. Build main query structure: filters and recall
301 436 if recall_clauses:
... ...
search/rerank_client.py
... ... @@ -153,12 +153,59 @@ def _extract_named_query_score(matched_queries: Any, name: str) -&gt; float:
153 153 return 0.0
154 154  
155 155  
  156 +def _resolve_named_query_score(
  157 + matched_queries: Any,
  158 + *,
  159 + preferred_names: List[str],
  160 + fallback_names: List[str],
  161 +) -> Tuple[float, Optional[str], float, Optional[str]]:
  162 + preferred_score = 0.0
  163 + preferred_name: Optional[str] = None
  164 + for name in preferred_names:
  165 + score = _extract_named_query_score(matched_queries, name)
  166 + if score > 0.0:
  167 + preferred_score = score
  168 + preferred_name = name
  169 + break
  170 +
  171 + fallback_score = 0.0
  172 + fallback_name: Optional[str] = None
  173 + for name in fallback_names:
  174 + score = _extract_named_query_score(matched_queries, name)
  175 + if score > 0.0:
  176 + fallback_score = score
  177 + fallback_name = name
  178 + break
  179 +
  180 + if preferred_name is None and preferred_names:
  181 + preferred_name = preferred_names[0]
  182 + preferred_score = _extract_named_query_score(matched_queries, preferred_name)
  183 + if fallback_name is None and fallback_names:
  184 + fallback_name = fallback_names[0]
  185 + fallback_score = _extract_named_query_score(matched_queries, fallback_name)
  186 + if preferred_score > 0.0:
  187 + return preferred_score, preferred_name, fallback_score, fallback_name
  188 + return fallback_score, fallback_name, preferred_score, preferred_name
  189 +
  190 +
156 191 def _collect_knn_score_components(
157 192 matched_queries: Any,
158 193 fusion: RerankFusionConfig,
159 194 ) -> Dict[str, float]:
160   - text_knn_score = _extract_named_query_score(matched_queries, "knn_query")
161   - image_knn_score = _extract_named_query_score(matched_queries, "image_knn_query")
  195 + text_knn_score, text_knn_source, _, _ = _resolve_named_query_score(
  196 + matched_queries,
  197 + preferred_names=["exact_text_knn_query"],
  198 + fallback_names=["knn_query"],
  199 + )
  200 + image_knn_score, image_knn_source, _, _ = _resolve_named_query_score(
  201 + matched_queries,
  202 + preferred_names=["exact_image_knn_query"],
  203 + fallback_names=["image_knn_query"],
  204 + )
  205 + exact_text_knn_score = _extract_named_query_score(matched_queries, "exact_text_knn_query")
  206 + exact_image_knn_score = _extract_named_query_score(matched_queries, "exact_image_knn_query")
  207 + approx_text_knn_score = _extract_named_query_score(matched_queries, "knn_query")
  208 + approx_image_knn_score = _extract_named_query_score(matched_queries, "image_knn_query")
162 209  
163 210 weighted_text_knn_score = text_knn_score * float(fusion.knn_text_weight)
164 211 weighted_image_knn_score = image_knn_score * float(fusion.knn_image_weight)
... ... @@ -171,6 +218,14 @@ def _collect_knn_score_components(
171 218 return {
172 219 "text_knn_score": text_knn_score,
173 220 "image_knn_score": image_knn_score,
  221 + "exact_text_knn_score": exact_text_knn_score,
  222 + "exact_image_knn_score": exact_image_knn_score,
  223 + "approx_text_knn_score": approx_text_knn_score,
  224 + "approx_image_knn_score": approx_image_knn_score,
  225 + "text_knn_source": text_knn_source,
  226 + "image_knn_source": image_knn_source,
  227 + "approx_text_knn_source": "knn_query",
  228 + "approx_image_knn_source": "image_knn_query",
174 229 "weighted_text_knn_score": weighted_text_knn_score,
175 230 "weighted_image_knn_score": weighted_image_knn_score,
176 231 "primary_knn_score": primary_knn_score,
... ... @@ -322,6 +377,10 @@ def _build_ltr_feature_block(
322 377 "text_support_score": float(text_components["support_text_score"]),
323 378 "text_knn_score": text_knn_score,
324 379 "image_knn_score": image_knn_score,
  380 + "exact_text_knn_score": float(knn_components["exact_text_knn_score"]),
  381 + "exact_image_knn_score": float(knn_components["exact_image_knn_score"]),
  382 + "approx_text_knn_score": float(knn_components["approx_text_knn_score"]),
  383 + "approx_image_knn_score": float(knn_components["approx_image_knn_score"]),
325 384 "knn_primary_score": float(knn_components["primary_knn_score"]),
326 385 "knn_support_score": float(knn_components["support_knn_score"]),
327 386 "has_text_match": source_score > 0.0,
... ... @@ -337,12 +396,50 @@ def _build_ltr_feature_block(
337 396 }
338 397  
339 398  
  399 +def _maybe_append_weighted_knn_terms(
  400 + *,
  401 + term_rows: List[Dict[str, Any]],
  402 + fusion: CoarseRankFusionConfig | RerankFusionConfig,
  403 + knn_components: Optional[Dict[str, Any]],
  404 +) -> None:
  405 + if not knn_components:
  406 + return
  407 +
  408 + weighted_text_knn_score = _to_score(knn_components.get("weighted_text_knn_score"))
  409 + weighted_image_knn_score = _to_score(knn_components.get("weighted_image_knn_score"))
  410 +
  411 + if float(getattr(fusion, "knn_text_exponent", 0.0)) != 0.0:
  412 + text_bias = float(getattr(fusion, "knn_text_bias", fusion.knn_bias))
  413 + term_rows.append(
  414 + {
  415 + "name": "weighted_text_knn_score",
  416 + "raw_score": weighted_text_knn_score,
  417 + "bias": text_bias,
  418 + "exponent": float(fusion.knn_text_exponent),
  419 + "factor": (max(weighted_text_knn_score, 0.0) + text_bias) ** float(fusion.knn_text_exponent),
  420 + }
  421 + )
  422 + if float(getattr(fusion, "knn_image_exponent", 0.0)) != 0.0:
  423 + image_bias = float(getattr(fusion, "knn_image_bias", fusion.knn_bias))
  424 + term_rows.append(
  425 + {
  426 + "name": "weighted_image_knn_score",
  427 + "raw_score": weighted_image_knn_score,
  428 + "bias": image_bias,
  429 + "exponent": float(fusion.knn_image_exponent),
  430 + "factor": (max(weighted_image_knn_score, 0.0) + image_bias)
  431 + ** float(fusion.knn_image_exponent),
  432 + }
  433 + )
  434 +
  435 +
340 436 def _compute_multiplicative_fusion(
341 437 *,
342 438 es_score: float,
343 439 text_score: float,
344 440 knn_score: float,
345 441 fusion: RerankFusionConfig,
  442 + knn_components: Optional[Dict[str, Any]] = None,
346 443 rerank_score: Optional[float] = None,
347 444 fine_score: Optional[float] = None,
348 445 style_boost: float = 1.0,
... ... @@ -368,6 +465,7 @@ def _compute_multiplicative_fusion(
368 465 _add_term("fine_score", fine_score, fusion.fine_bias, fusion.fine_exponent)
369 466 _add_term("text_score", text_score, fusion.text_bias, fusion.text_exponent)
370 467 _add_term("knn_score", knn_score, fusion.knn_bias, fusion.knn_exponent)
  468 + _maybe_append_weighted_knn_terms(term_rows=term_rows, fusion=fusion, knn_components=knn_components)
371 469  
372 470 fused = 1.0
373 471 factors: Dict[str, float] = {}
... ... @@ -391,12 +489,30 @@ def _multiply_coarse_fusion_factors(
391 489 es_score: float,
392 490 text_score: float,
393 491 knn_score: float,
  492 + knn_components: Dict[str, Any],
394 493 fusion: CoarseRankFusionConfig,
395   -) -> Tuple[float, float, float, float]:
  494 +) -> Tuple[float, float, float, float, float, float]:
396 495 es_factor = (max(es_score, 0.0) + fusion.es_bias) ** fusion.es_exponent
397 496 text_factor = (max(text_score, 0.0) + fusion.text_bias) ** fusion.text_exponent
398 497 knn_factor = (max(knn_score, 0.0) + fusion.knn_bias) ** fusion.knn_exponent
399   - return es_factor, text_factor, knn_factor, es_factor * text_factor * knn_factor
  498 + text_knn_bias = float(getattr(fusion, "knn_text_bias", fusion.knn_bias))
  499 + image_knn_bias = float(getattr(fusion, "knn_image_bias", fusion.knn_bias))
  500 + text_knn_factor = (
  501 + (max(_to_score(knn_components.get("weighted_text_knn_score")), 0.0) + text_knn_bias)
  502 + ** float(getattr(fusion, "knn_text_exponent", 0.0))
  503 + )
  504 + image_knn_factor = (
  505 + (max(_to_score(knn_components.get("weighted_image_knn_score")), 0.0) + image_knn_bias)
  506 + ** float(getattr(fusion, "knn_image_exponent", 0.0))
  507 + )
  508 + return (
  509 + es_factor,
  510 + text_factor,
  511 + knn_factor,
  512 + text_knn_factor,
  513 + image_knn_factor,
  514 + es_factor * text_factor * knn_factor * text_knn_factor * image_knn_factor,
  515 + )
400 516  
401 517  
402 518 def _has_selected_sku(hit: Dict[str, Any]) -> bool:
... ... @@ -422,10 +538,18 @@ def coarse_resort_hits(
422 538 knn_components = signal_bundle["knn_components"]
423 539 text_score = signal_bundle["text_score"]
424 540 knn_score = signal_bundle["knn_score"]
425   - es_factor, text_factor, knn_factor, coarse_score = _multiply_coarse_fusion_factors(
  541 + (
  542 + es_factor,
  543 + text_factor,
  544 + knn_factor,
  545 + text_knn_factor,
  546 + image_knn_factor,
  547 + coarse_score,
  548 + ) = _multiply_coarse_fusion_factors(
426 549 es_score=es_score,
427 550 text_score=text_score,
428 551 knn_score=knn_score,
  552 + knn_components=knn_components,
429 553 fusion=f,
430 554 )
431 555  
... ... @@ -433,6 +557,8 @@ def coarse_resort_hits(
433 557 hit["_knn_score"] = knn_score
434 558 hit["_text_knn_score"] = knn_components["text_knn_score"]
435 559 hit["_image_knn_score"] = knn_components["image_knn_score"]
  560 + hit["_exact_text_knn_score"] = knn_components["exact_text_knn_score"]
  561 + hit["_exact_image_knn_score"] = knn_components["exact_image_knn_score"]
436 562 hit["_coarse_score"] = coarse_score
437 563  
438 564 if debug:
... ... @@ -460,6 +586,12 @@ def coarse_resort_hits(
460 586 ),
461 587 "text_knn_score": knn_components["text_knn_score"],
462 588 "image_knn_score": knn_components["image_knn_score"],
  589 + "exact_text_knn_score": knn_components["exact_text_knn_score"],
  590 + "exact_image_knn_score": knn_components["exact_image_knn_score"],
  591 + "approx_text_knn_score": knn_components["approx_text_knn_score"],
  592 + "approx_image_knn_score": knn_components["approx_image_knn_score"],
  593 + "text_knn_source": knn_components["text_knn_source"],
  594 + "image_knn_source": knn_components["image_knn_source"],
463 595 "weighted_text_knn_score": knn_components["weighted_text_knn_score"],
464 596 "weighted_image_knn_score": knn_components["weighted_image_knn_score"],
465 597 "knn_primary_score": knn_components["primary_knn_score"],
... ... @@ -468,6 +600,8 @@ def coarse_resort_hits(
468 600 "coarse_es_factor": es_factor,
469 601 "coarse_text_factor": text_factor,
470 602 "coarse_knn_factor": knn_factor,
  603 + "coarse_text_knn_factor": text_knn_factor,
  604 + "coarse_image_knn_factor": image_knn_factor,
471 605 "coarse_score": coarse_score,
472 606 "matched_queries": matched_queries,
473 607 "ltr_features": ltr_features,
... ... @@ -509,7 +643,7 @@ def fuse_scores_and_resort(
509 643 - _rerank_score: 重排服务返回的分数
510 644 - _fused_score: 融合分数
511 645 - _text_score: 文本相关性分数(优先取 named queries 的 base_query 分数)
512   - - _knn_score: KNN 分数(优先取 named queries 的 knn_query 分数
  646 + - _knn_score: KNN 分数(优先取 exact named queries,缺失时回退 ANN named queries
513 647  
514 648 Args:
515 649 es_hits: ES hits 列表(会被原地修改)
... ... @@ -545,6 +679,7 @@ def fuse_scores_and_resort(
545 679 text_score=text_score,
546 680 knn_score=knn_score,
547 681 fusion=f,
  682 + knn_components=knn_components,
548 683 style_boost=style_boost,
549 684 )
550 685 fused = fusion_result["score"]
... ... @@ -557,6 +692,8 @@ def fuse_scores_and_resort(
557 692 hit["_knn_score"] = knn_score
558 693 hit["_text_knn_score"] = knn_components["text_knn_score"]
559 694 hit["_image_knn_score"] = knn_components["image_knn_score"]
  695 + hit["_exact_text_knn_score"] = knn_components["exact_text_knn_score"]
  696 + hit["_exact_image_knn_score"] = knn_components["exact_image_knn_score"]
560 697 hit["_fused_score"] = fused
561 698 hit["_style_intent_selected_sku_boost"] = style_boost
562 699  
... ... @@ -589,6 +726,12 @@ def fuse_scores_and_resort(
589 726 "text_support_score": text_components["support_text_score"],
590 727 "text_knn_score": knn_components["text_knn_score"],
591 728 "image_knn_score": knn_components["image_knn_score"],
  729 + "exact_text_knn_score": knn_components["exact_text_knn_score"],
  730 + "exact_image_knn_score": knn_components["exact_image_knn_score"],
  731 + "approx_text_knn_score": knn_components["approx_text_knn_score"],
  732 + "approx_image_knn_score": knn_components["approx_image_knn_score"],
  733 + "text_knn_source": knn_components["text_knn_source"],
  734 + "image_knn_source": knn_components["image_knn_source"],
592 735 "weighted_text_knn_score": knn_components["weighted_text_knn_score"],
593 736 "weighted_image_knn_score": knn_components["weighted_image_knn_score"],
594 737 "knn_primary_score": knn_components["primary_knn_score"],
... ... @@ -603,6 +746,8 @@ def fuse_scores_and_resort(
603 746 "es_factor": fusion_result["factors"].get("es_score"),
604 747 "text_factor": fusion_result["factors"].get("text_score"),
605 748 "knn_factor": fusion_result["factors"].get("knn_score"),
  749 + "text_knn_factor": fusion_result["factors"].get("weighted_text_knn_score"),
  750 + "image_knn_factor": fusion_result["factors"].get("weighted_image_knn_score"),
606 751 "style_intent_selected_sku": sku_selected,
607 752 "style_intent_selected_sku_boost": style_boost,
608 753 "matched_queries": signal_bundle["matched_queries"],
... ... @@ -735,6 +880,7 @@ def run_lightweight_rerank(
735 880 text_score=text_score,
736 881 knn_score=knn_score,
737 882 fusion=f,
  883 + knn_components=signal_bundle["knn_components"],
738 884 style_boost=style_boost,
739 885 )
740 886  
... ... @@ -744,6 +890,8 @@ def run_lightweight_rerank(
744 890 hit["_knn_score"] = knn_score
745 891 hit["_text_knn_score"] = signal_bundle["knn_components"]["text_knn_score"]
746 892 hit["_image_knn_score"] = signal_bundle["knn_components"]["image_knn_score"]
  893 + hit["_exact_text_knn_score"] = signal_bundle["knn_components"]["exact_text_knn_score"]
  894 + hit["_exact_image_knn_score"] = signal_bundle["knn_components"]["exact_image_knn_score"]
747 895 hit["_style_intent_selected_sku_boost"] = style_boost
748 896  
749 897 if debug:
... ... @@ -769,6 +917,8 @@ def run_lightweight_rerank(
769 917 "es_factor": fusion_result["factors"].get("es_score"),
770 918 "text_factor": fusion_result["factors"].get("text_score"),
771 919 "knn_factor": fusion_result["factors"].get("knn_score"),
  920 + "text_knn_factor": fusion_result["factors"].get("weighted_text_knn_score"),
  921 + "image_knn_factor": fusion_result["factors"].get("weighted_image_knn_score"),
772 922 "style_intent_selected_sku": sku_selected,
773 923 "style_intent_selected_sku_boost": style_boost,
774 924 "ltr_features": ltr_features,
... ...
search/searcher.py
... ... @@ -236,6 +236,81 @@ class Searcher:
236 236 return
237 237 es_query["_source"] = {"includes": self.source_fields}
238 238  
  239 + def _resolve_exact_knn_rescore_window(self) -> int:
  240 + configured = int(self.config.rerank.exact_knn_rescore_window)
  241 + if configured > 0:
  242 + return configured
  243 + return int(self.config.rerank.rerank_window)
  244 +
  245 + def _build_exact_knn_rescore(
  246 + self,
  247 + *,
  248 + query_vector: Any,
  249 + image_query_vector: Any,
  250 + parsed_query: Optional[ParsedQuery] = None,
  251 + ) -> Optional[Dict[str, Any]]:
  252 + clauses: List[Dict[str, Any]] = []
  253 +
  254 + text_clause = self.query_builder.build_exact_text_knn_rescore_clause(
  255 + query_vector,
  256 + parsed_query=parsed_query,
  257 + query_name="exact_text_knn_query",
  258 + )
  259 + if text_clause:
  260 + clauses.append(text_clause)
  261 +
  262 + image_clause = self.query_builder.build_exact_image_knn_rescore_clause(
  263 + image_query_vector,
  264 + query_name="exact_image_knn_query",
  265 + )
  266 + if image_clause:
  267 + clauses.append(image_clause)
  268 +
  269 + if not clauses:
  270 + return None
  271 +
  272 + return {
  273 + "window_size": self._resolve_exact_knn_rescore_window(),
  274 + "query": {
  275 + # Phase 1: only compute exact vector scores and expose them in matched_queries.
  276 + "score_mode": "total",
  277 + "query_weight": 1.0,
  278 + "rescore_query_weight": 0.0,
  279 + "rescore_query": {
  280 + "bool": {
  281 + "should": clauses,
  282 + "minimum_should_match": 1,
  283 + }
  284 + },
  285 + },
  286 + }
  287 +
  288 + def _attach_exact_knn_rescore(
  289 + self,
  290 + es_query: Dict[str, Any],
  291 + *,
  292 + in_rank_window: bool,
  293 + query_vector: Any,
  294 + image_query_vector: Any,
  295 + parsed_query: Optional[ParsedQuery] = None,
  296 + ) -> None:
  297 + if not in_rank_window or not self.config.rerank.exact_knn_rescore_enabled:
  298 + return
  299 + rescore = self._build_exact_knn_rescore(
  300 + query_vector=query_vector,
  301 + image_query_vector=image_query_vector,
  302 + parsed_query=parsed_query,
  303 + )
  304 + if not rescore:
  305 + return
  306 + existing = es_query.get("rescore")
  307 + if existing is None:
  308 + es_query["rescore"] = rescore
  309 + elif isinstance(existing, list):
  310 + es_query["rescore"] = [*existing, rescore]
  311 + else:
  312 + es_query["rescore"] = [existing, rescore]
  313 +
239 314 def _resolve_rerank_source_filter(
240 315 self,
241 316 doc_template: str,
... ... @@ -401,7 +476,9 @@ class Searcher:
401 476 language: Response / field selection language hint (e.g. zh, en)
402 477 sku_filter_dimension: SKU grouping dimensions for per-SPU variant pick
403 478 enable_rerank: If None, use ``config.rerank.enabled``; if set, overrides
404   - whether the rerank provider is invoked (subject to rerank window).
  479 + whether the final rerank provider is invoked (subject to rank window).
  480 + When false, the ranking pipeline still runs and rerank stage becomes
  481 + pass-through.
405 482 rerank_query_template: Override for rerank query text template; None uses
406 483 ``config.rerank.rerank_query_template`` (e.g. ``"{query}"``).
407 484 rerank_doc_template: Override for per-hit document text passed to rerank;
... ... @@ -430,15 +507,16 @@ class Searcher:
430 507 # 重排开关优先级:请求参数显式传值 > 服务端配置(默认开启)
431 508 rerank_enabled_by_config = bool(rc.enabled)
432 509 do_rerank = rerank_enabled_by_config if enable_rerank is None else bool(enable_rerank)
  510 + fine_enabled = bool(fine_cfg.enabled)
433 511 rerank_window = rc.rerank_window
434 512 coarse_input_window = max(rerank_window, int(coarse_cfg.input_window))
435 513 coarse_output_window = max(rerank_window, int(coarse_cfg.output_window))
436 514 fine_input_window = max(rerank_window, int(fine_cfg.input_window))
437 515 fine_output_window = max(rerank_window, int(fine_cfg.output_window))
438   - # 若开启重排且请求范围在窗口内:从 ES 取前 rerank_window 条、重排后再按 from/size 分页;否则不重排,按原 from/size 查 ES
439   - in_rerank_window = do_rerank and (from_ + size) <= rerank_window
440   - es_fetch_from = 0 if in_rerank_window else from_
441   - es_fetch_size = coarse_input_window if in_rerank_window else size
  516 + # 多阶段排序窗口独立于最终 rerank 开关:即使关闭最终 rerank,也保留 coarse/fine 流程。
  517 + in_rank_window = (from_ + size) <= rerank_window
  518 + es_fetch_from = 0 if in_rank_window else from_
  519 + es_fetch_size = coarse_input_window if in_rank_window else size
442 520  
443 521 es_score_normalization_factor: Optional[float] = None
444 522 initial_ranks_by_doc: Dict[str, int] = {}
... ... @@ -455,7 +533,8 @@ class Searcher:
455 533 context.logger.info(
456 534 f"开始搜索请求 | 查询: '{query}' | 参数: size={size}, from_={from_}, "
457 535 f"enable_rerank(request)={enable_rerank}, enable_rerank(config)={rerank_enabled_by_config}, "
458   - f"enable_rerank(effective)={do_rerank}, in_rerank_window={in_rerank_window}, "
  536 + f"fine_enabled(config)={fine_enabled}, "
  537 + f"enable_rerank(effective)={do_rerank}, in_rank_window={in_rank_window}, "
459 538 f"es_fetch=({es_fetch_from},{es_fetch_size}) | "
460 539 f"index_languages={index_langs} | "
461 540 f"enable_translation={enable_translation}, enable_embedding={enable_embedding}, min_score={min_score}",
... ... @@ -468,8 +547,9 @@ class Searcher:
468 547 'from_': from_,
469 548 'es_fetch_from': es_fetch_from,
470 549 'es_fetch_size': es_fetch_size,
471   - 'in_rerank_window': in_rerank_window,
  550 + 'in_rank_window': in_rank_window,
472 551 'rerank_enabled_by_config': rerank_enabled_by_config,
  552 + 'fine_enabled': fine_enabled,
473 553 'enable_rerank_request': enable_rerank,
474 554 'rerank_query_template': effective_query_template,
475 555 'rerank_doc_template': effective_doc_template,
... ... @@ -494,6 +574,7 @@ class Searcher:
494 574 context.metadata['feature_flags'] = {
495 575 'translation_enabled': enable_translation,
496 576 'embedding_enabled': enable_embedding,
  577 + 'fine_enabled': fine_enabled,
497 578 'rerank_enabled': do_rerank,
498 579 'style_intent_enabled': bool(self.style_intent_registry.enabled),
499 580 }
... ... @@ -526,7 +607,7 @@ class Searcher:
526 607 f"语言: {parsed_query.detected_language} | "
527 608 f"关键词: {parsed_query.keywords_queries} | "
528 609 f"文本向量: {'是' if parsed_query.query_vector is not None else '否'} | "
529   - f"图片向量: {'是' if getattr(parsed_query, 'image_query_vector', None) is not None else '否'}",
  610 + f"图片向量: {'是' if parsed_query.image_query_vector is not None else '否'}",
530 611 extra={'reqid': context.reqid, 'uid': context.uid}
531 612 )
532 613 except Exception as e:
... ... @@ -545,17 +626,16 @@ class Searcher:
545 626 # Generate tenant-specific index name
546 627 index_name = get_tenant_index_name(tenant_id)
547 628 # index_name = "search_products"
548   -
  629 +
549 630 # No longer need to add tenant_id to filters since each tenant has its own index
  631 + image_query_vector = None
  632 + if enable_embedding:
  633 + image_query_vector = parsed_query.image_query_vector
550 634  
551 635 es_query = self.query_builder.build_query(
552 636 query_text=parsed_query.rewritten_query or parsed_query.query_normalized,
553 637 query_vector=parsed_query.query_vector if enable_embedding else None,
554   - image_query_vector=(
555   - getattr(parsed_query, "image_query_vector", None)
556   - if enable_embedding
557   - else None
558   - ),
  638 + image_query_vector=image_query_vector,
559 639 filters=filters,
560 640 range_filters=range_filters,
561 641 facet_configs=facets,
... ... @@ -563,11 +643,18 @@ class Searcher:
563 643 from_=es_fetch_from,
564 644 enable_knn=enable_embedding and (
565 645 parsed_query.query_vector is not None
566   - or getattr(parsed_query, "image_query_vector", None) is not None
  646 + or image_query_vector is not None
567 647 ),
568 648 min_score=min_score,
569 649 parsed_query=parsed_query,
570 650 )
  651 + self._attach_exact_knn_rescore(
  652 + es_query,
  653 + in_rank_window=in_rank_window,
  654 + query_vector=parsed_query.query_vector if enable_embedding else None,
  655 + image_query_vector=image_query_vector,
  656 + parsed_query=parsed_query,
  657 + )
571 658  
572 659 # Add facets for faceted search
573 660 if facets:
... ... @@ -587,8 +674,7 @@ class Searcher:
587 674  
588 675 # In multi-stage rank window, first pass only needs score signals for coarse rank.
589 676 es_query_for_fetch = es_query
590   - rerank_prefetch_source = None
591   - if in_rerank_window:
  677 + if in_rank_window:
592 678 es_query_for_fetch = dict(es_query)
593 679 es_query_for_fetch["_source"] = False
594 680  
... ... @@ -597,31 +683,28 @@ class Searcher:
597 683  
598 684 # Store ES query in context
599 685 context.store_intermediate_result('es_query', es_query)
600   - if in_rerank_window and rerank_prefetch_source is not None:
601   - context.store_intermediate_result('es_query_rerank_prefetch_source', rerank_prefetch_source)
602 686 # Serialize ES query to compute a compact size + stable digest for correlation
603 687 es_query_compact = json.dumps(es_query_for_fetch, ensure_ascii=False, separators=(",", ":"))
604 688 es_query_digest = hashlib.sha256(es_query_compact.encode("utf-8")).hexdigest()[:16]
605 689 knn_enabled = bool(enable_embedding and (
606 690 parsed_query.query_vector is not None
607   - or getattr(parsed_query, "image_query_vector", None) is not None
  691 + or image_query_vector is not None
608 692 ))
609 693 vector_dims = int(len(parsed_query.query_vector)) if parsed_query.query_vector is not None else 0
610 694 image_vector_dims = (
611   - int(len(parsed_query.image_query_vector))
612   - if getattr(parsed_query, "image_query_vector", None) is not None
  695 + int(len(image_query_vector))
  696 + if image_query_vector is not None
613 697 else 0
614 698 )
615 699  
616 700 context.logger.info(
617   - "ES query built | size: %s chars | digest: %s | KNN: %s | vector_dims: %s | image_vector_dims: %s | facets: %s | rerank_prefetch_source: %s",
  701 + "ES query built | size: %s chars | digest: %s | KNN: %s | vector_dims: %s | image_vector_dims: %s | facets: %s",
618 702 len(es_query_compact),
619 703 es_query_digest,
620 704 "yes" if knn_enabled else "no",
621 705 vector_dims,
622 706 image_vector_dims,
623 707 "yes" if facets else "no",
624   - rerank_prefetch_source,
625 708 extra={'reqid': context.reqid, 'uid': context.uid}
626 709 )
627 710 _log_backend_verbose({
... ... @@ -656,7 +739,7 @@ class Searcher:
656 739 body=body_for_es,
657 740 size=es_fetch_size,
658 741 from_=es_fetch_from,
659   - include_named_queries_score=bool(do_rerank and in_rerank_window),
  742 + include_named_queries_score=bool(in_rank_window),
660 743 )
661 744  
662 745 # Store ES response in context
... ... @@ -698,10 +781,177 @@ class Searcher:
698 781 context.end_stage(RequestContextStage.ELASTICSEARCH_SEARCH_PRIMARY)
699 782  
700 783 style_intent_decisions: Dict[str, SkuSelectionDecision] = {}
701   - if do_rerank and in_rerank_window:
  784 + if in_rank_window:
702 785 from dataclasses import asdict
703 786 from config.services_config import get_rerank_backend_config, get_rerank_service_url
704 787 from .rerank_client import coarse_resort_hits, run_lightweight_rerank, run_rerank
  788 + coarse_fusion_debug = asdict(coarse_cfg.fusion)
  789 + stage_fusion_debug = asdict(rc.fusion)
  790 +
  791 + def _rank_map(stage_hits: List[Dict[str, Any]]) -> Dict[str, int]:
  792 + return {
  793 + str(hit.get("_id")): rank
  794 + for rank, hit in enumerate(stage_hits, 1)
  795 + if hit.get("_id") is not None
  796 + }
  797 +
  798 + def _stage_debug_info(
  799 + *,
  800 + enabled: bool,
  801 + applied: bool,
  802 + skipped_reason: Optional[str],
  803 + service_profile: Optional[str],
  804 + query_template: str,
  805 + doc_template: str,
  806 + docs_in: int,
  807 + docs_out: int,
  808 + top_n: int,
  809 + meta: Optional[Dict[str, Any]] = None,
  810 + backend: Optional[str] = None,
  811 + backend_model_name: Optional[str] = None,
  812 + service_url: Optional[str] = None,
  813 + model: Optional[str] = None,
  814 + fusion: Optional[Dict[str, Any]] = None,
  815 + ) -> Dict[str, Any]:
  816 + return {
  817 + "enabled": enabled,
  818 + "applied": applied,
  819 + "passthrough": not applied,
  820 + "skipped_reason": skipped_reason,
  821 + "service_profile": service_profile,
  822 + "service_url": service_url,
  823 + "backend": backend,
  824 + "model": model,
  825 + "backend_model_name": backend_model_name,
  826 + "query_template": query_template,
  827 + "doc_template": doc_template,
  828 + "query_text": str(query_template).format_map({"query": rerank_query}),
  829 + "docs_in": docs_in,
  830 + "docs_out": docs_out,
  831 + "top_n": top_n,
  832 + "meta": meta,
  833 + "fusion": fusion,
  834 + }
  835 +
  836 + def _run_optional_stage(
  837 + *,
  838 + stage: RequestContextStage,
  839 + stage_label: str,
  840 + enabled: bool,
  841 + stage_hits: List[Dict[str, Any]],
  842 + input_limit: int,
  843 + output_limit: int,
  844 + service_profile: Optional[str],
  845 + query_template: str,
  846 + doc_template: str,
  847 + top_n: int,
  848 + debug_key: Optional[str],
  849 + runner,
  850 + ) -> tuple[List[Dict[str, Any]], Dict[str, int], Optional[Dict[str, Any]]]:
  851 + context.start_stage(stage)
  852 + try:
  853 + input_hits = list(stage_hits[:input_limit])
  854 + output_hits = list(stage_hits[:output_limit])
  855 + applied = False
  856 + skip_reason: Optional[str] = None
  857 + meta: Optional[Dict[str, Any]] = None
  858 + debug_rows: Optional[List[Dict[str, Any]]] = None
  859 +
  860 + if enabled and input_hits:
  861 + output_hits_candidate, applied, meta, debug_rows = runner(input_hits)
  862 + if applied:
  863 + output_hits = list((output_hits_candidate or input_hits)[:output_limit])
  864 + else:
  865 + skip_reason = "service_returned_none"
  866 + else:
  867 + skip_reason = "disabled" if not enabled else "no_hits"
  868 +
  869 + ranks = _rank_map(output_hits) if debug else {}
  870 + stage_info = None
  871 + if debug:
  872 + if applied:
  873 + backend_name, backend_cfg = get_rerank_backend_config(service_profile)
  874 + stage_info = _stage_debug_info(
  875 + enabled=True,
  876 + applied=True,
  877 + skipped_reason=None,
  878 + service_profile=service_profile,
  879 + service_url=get_rerank_service_url(profile=service_profile),
  880 + backend=backend_name,
  881 + backend_model_name=backend_cfg.get("model_name"),
  882 + model=meta.get("model") if isinstance(meta, dict) else None,
  883 + query_template=query_template,
  884 + doc_template=doc_template,
  885 + docs_in=len(input_hits),
  886 + docs_out=len(output_hits),
  887 + top_n=top_n,
  888 + meta=meta,
  889 + fusion=stage_fusion_debug,
  890 + )
  891 + if debug_key is not None and debug_rows is not None:
  892 + context.store_intermediate_result(debug_key, debug_rows)
  893 + else:
  894 + stage_info = _stage_debug_info(
  895 + enabled=enabled,
  896 + applied=False,
  897 + skipped_reason=skip_reason,
  898 + service_profile=service_profile,
  899 + query_template=query_template,
  900 + doc_template=doc_template,
  901 + docs_in=len(input_hits),
  902 + docs_out=len(output_hits),
  903 + top_n=top_n,
  904 + fusion=stage_fusion_debug,
  905 + )
  906 +
  907 + if applied:
  908 + context.logger.info(
  909 + "%s完成 | docs=%s | top_n=%s | meta=%s",
  910 + stage_label,
  911 + len(output_hits),
  912 + top_n,
  913 + meta,
  914 + extra={'reqid': context.reqid, 'uid': context.uid}
  915 + )
  916 + else:
  917 + context.logger.info(
  918 + "%s透传 | reason=%s | docs=%s | top_n=%s",
  919 + stage_label,
  920 + skip_reason,
  921 + len(output_hits),
  922 + top_n,
  923 + extra={'reqid': context.reqid, 'uid': context.uid}
  924 + )
  925 + return output_hits, ranks, stage_info
  926 + except Exception as e:
  927 + output_hits = list(stage_hits[:output_limit])
  928 + ranks = _rank_map(output_hits) if debug else {}
  929 + stage_info = None
  930 + if debug:
  931 + stage_info = _stage_debug_info(
  932 + enabled=enabled,
  933 + applied=False,
  934 + skipped_reason="error",
  935 + service_profile=service_profile,
  936 + query_template=query_template,
  937 + doc_template=doc_template,
  938 + docs_in=min(len(stage_hits), input_limit),
  939 + docs_out=len(output_hits),
  940 + top_n=top_n,
  941 + meta={"error": str(e)},
  942 + fusion=stage_fusion_debug,
  943 + )
  944 + context.add_warning(f"{stage_label} failed: {e}")
  945 + context.logger.warning(
  946 + "调用%s服务失败 | error: %s",
  947 + stage_label,
  948 + e,
  949 + extra={'reqid': context.reqid, 'uid': context.uid},
  950 + exc_info=True,
  951 + )
  952 + return output_hits, ranks, stage_info
  953 + finally:
  954 + context.end_stage(stage)
705 955  
706 956 rerank_query = parsed_query.text_for_rerank() if parsed_query else query
707 957 hits = es_response.get("hits", {}).get("hits") or []
... ... @@ -716,17 +966,12 @@ class Searcher:
716 966 hits = hits[:coarse_output_window]
717 967 es_response.setdefault("hits", {})["hits"] = hits
718 968 if debug:
719   - coarse_ranks_by_doc = {
720   - str(hit.get("_id")): rank
721   - for rank, hit in enumerate(hits, 1)
722   - if hit.get("_id") is not None
  969 + coarse_ranks_by_doc = _rank_map(hits)
  970 + coarse_debug_info = {
  971 + "docs_in": es_fetch_size,
  972 + "docs_out": len(hits),
  973 + "fusion": coarse_fusion_debug,
723 974 }
724   - if debug:
725   - coarse_debug_info = {
726   - "docs_in": es_fetch_size,
727   - "docs_out": len(hits),
728   - "fusion": asdict(coarse_cfg.fusion),
729   - }
730 975 context.store_intermediate_result("coarse_rank_scores", coarse_debug)
731 976 context.logger.info(
732 977 "粗排完成 | docs_in=%s | docs_out=%s",
... ... @@ -777,72 +1022,42 @@ class Searcher:
777 1022 extra={'reqid': context.reqid, 'uid': context.uid}
778 1023 )
779 1024  
780   - fine_scores: Optional[List[float]] = None
781   - hits = es_response.get("hits", {}).get("hits") or []
782   - if fine_cfg.enabled and hits:
783   - context.start_stage(RequestContextStage.FINE_RANKING)
784   - try:
785   - fine_scores, fine_meta, fine_debug_rows = run_lightweight_rerank(
786   - query=rerank_query,
787   - es_hits=hits[:fine_input_window],
788   - language=language,
789   - timeout_sec=fine_cfg.timeout_sec,
790   - rerank_query_template=fine_query_template,
791   - rerank_doc_template=fine_doc_template,
792   - top_n=fine_output_window,
793   - debug=debug,
794   - fusion=rc.fusion,
795   - style_intent_selected_sku_boost=self.config.query_config.style_intent_selected_sku_boost,
796   - service_profile=fine_cfg.service_profile,
797   - )
798   - if fine_scores is not None:
799   - hits = hits[:fine_output_window]
800   - es_response["hits"]["hits"] = hits
801   - if debug:
802   - fine_ranks_by_doc = {
803   - str(hit.get("_id")): rank
804   - for rank, hit in enumerate(hits, 1)
805   - if hit.get("_id") is not None
806   - }
807   - fine_backend_name, fine_backend_cfg = get_rerank_backend_config(fine_cfg.service_profile)
808   - fine_debug_info = {
809   - "service_profile": fine_cfg.service_profile,
810   - "service_url": get_rerank_service_url(profile=fine_cfg.service_profile),
811   - "backend": fine_backend_name,
812   - "model": fine_meta.get("model") if isinstance(fine_meta, dict) else None,
813   - "backend_model_name": fine_backend_cfg.get("model_name"),
814   - "query_template": fine_query_template,
815   - "doc_template": fine_doc_template,
816   - "query_text": str(fine_query_template).format_map({"query": rerank_query}),
817   - "docs_in": min(len(fine_scores), fine_input_window),
818   - "docs_out": len(hits),
819   - "top_n": fine_output_window,
820   - "meta": fine_meta,
821   - "fusion": asdict(rc.fusion),
822   - }
823   - context.store_intermediate_result("fine_rank_scores", fine_debug_rows)
824   - context.logger.info(
825   - "精排完成 | docs=%s | top_n=%s | meta=%s",
826   - len(hits),
827   - fine_output_window,
828   - fine_meta,
829   - extra={'reqid': context.reqid, 'uid': context.uid}
830   - )
831   - except Exception as e:
832   - context.add_warning(f"Fine rerank failed: {e}")
833   - context.logger.warning(
834   - f"调用精排服务失败 | error: {e}",
835   - extra={'reqid': context.reqid, 'uid': context.uid},
836   - exc_info=True,
837   - )
838   - finally:
839   - context.end_stage(RequestContextStage.FINE_RANKING)
  1025 + def _run_fine_stage(stage_input: List[Dict[str, Any]]):
  1026 + fine_scores, fine_meta, fine_debug_rows = run_lightweight_rerank(
  1027 + query=rerank_query,
  1028 + es_hits=stage_input,
  1029 + language=language,
  1030 + timeout_sec=fine_cfg.timeout_sec,
  1031 + rerank_query_template=fine_query_template,
  1032 + rerank_doc_template=fine_doc_template,
  1033 + top_n=fine_output_window,
  1034 + debug=debug,
  1035 + fusion=rc.fusion,
  1036 + style_intent_selected_sku_boost=self.config.query_config.style_intent_selected_sku_boost,
  1037 + service_profile=fine_cfg.service_profile,
  1038 + )
  1039 + return stage_input, fine_scores is not None, fine_meta, fine_debug_rows
  1040 +
  1041 + hits, fine_ranks_by_doc, fine_debug_info = _run_optional_stage(
  1042 + stage=RequestContextStage.FINE_RANKING,
  1043 + stage_label="精排",
  1044 + enabled=fine_enabled,
  1045 + stage_hits=es_response.get("hits", {}).get("hits") or [],
  1046 + input_limit=fine_input_window,
  1047 + output_limit=fine_output_window,
  1048 + service_profile=fine_cfg.service_profile,
  1049 + query_template=fine_query_template,
  1050 + doc_template=fine_doc_template,
  1051 + top_n=fine_output_window,
  1052 + debug_key="fine_rank_scores",
  1053 + runner=_run_fine_stage,
  1054 + )
  1055 + es_response["hits"]["hits"] = hits
840 1056  
841   - context.start_stage(RequestContextStage.RERANKING)
842   - try:
843   - final_hits = es_response.get("hits", {}).get("hits") or []
844   - final_input = final_hits[:rerank_window]
845   - es_response["hits"]["hits"] = final_input
  1057 + def _run_rerank_stage(stage_input: List[Dict[str, Any]]):
  1058 + nonlocal es_response
  1059 +
  1060 + es_response["hits"]["hits"] = stage_input
846 1061 es_response, rerank_meta, fused_debug = run_rerank(
847 1062 query=rerank_query,
848 1063 es_response=es_response,
... ... @@ -858,48 +1073,31 @@ class Searcher:
858 1073 service_profile=rc.service_profile,
859 1074 style_intent_selected_sku_boost=self.config.query_config.style_intent_selected_sku_boost,
860 1075 )
861   -
862   - if rerank_meta is not None:
863   - if debug:
864   - rerank_ranks_by_doc = {
865   - str(hit.get("_id")): rank
866   - for rank, hit in enumerate(es_response.get("hits", {}).get("hits") or [], 1)
867   - if hit.get("_id") is not None
868   - }
869   - rerank_backend_name, rerank_backend_cfg = get_rerank_backend_config(rc.service_profile)
870   - rerank_debug_info = {
871   - "service_profile": rc.service_profile,
872   - "service_url": get_rerank_service_url(profile=rc.service_profile),
873   - "backend": rerank_backend_name,
874   - "model": rerank_meta.get("model") if isinstance(rerank_meta, dict) else None,
875   - "backend_model_name": rerank_backend_cfg.get("model_name"),
876   - "query_template": effective_query_template,
877   - "doc_template": effective_doc_template,
878   - "query_text": str(effective_query_template).format_map({"query": rerank_query}),
879   - "docs_in": len(final_input),
880   - "docs_out": len(es_response.get("hits", {}).get("hits") or []),
881   - "top_n": from_ + size,
882   - "meta": rerank_meta,
883   - "fusion": asdict(rc.fusion),
884   - }
885   - context.store_intermediate_result("rerank_scores", fused_debug)
886   - context.logger.info(
887   - f"重排完成 | docs={len(es_response.get('hits', {}).get('hits') or [])} | "
888   - f"top_n={from_ + size} | meta={rerank_meta}",
889   - extra={'reqid': context.reqid, 'uid': context.uid}
890   - )
891   - except Exception as e:
892   - context.add_warning(f"Rerank failed: {e}")
893   - context.logger.warning(
894   - f"调用重排服务失败 | error: {e}",
895   - extra={'reqid': context.reqid, 'uid': context.uid},
896   - exc_info=True,
  1076 + return (
  1077 + es_response.get("hits", {}).get("hits") or [],
  1078 + rerank_meta is not None,
  1079 + rerank_meta,
  1080 + fused_debug,
897 1081 )
898   - finally:
899   - context.end_stage(RequestContextStage.RERANKING)
900 1082  
901   - # 当本次请求在重排窗口内时:已按多阶段排序产出前 rerank_window 条,需按请求的 from/size 做分页切片
902   - if in_rerank_window:
  1083 + hits, rerank_ranks_by_doc, rerank_debug_info = _run_optional_stage(
  1084 + stage=RequestContextStage.RERANKING,
  1085 + stage_label="重排",
  1086 + enabled=do_rerank,
  1087 + stage_hits=es_response.get("hits", {}).get("hits") or [],
  1088 + input_limit=rerank_window,
  1089 + output_limit=rerank_window,
  1090 + service_profile=rc.service_profile,
  1091 + query_template=effective_query_template,
  1092 + doc_template=effective_doc_template,
  1093 + top_n=from_ + size,
  1094 + debug_key="rerank_scores",
  1095 + runner=_run_rerank_stage,
  1096 + )
  1097 + es_response["hits"]["hits"] = hits
  1098 +
  1099 + # 当本次请求在排序窗口内时:已按多阶段排序产出前 rerank_window 条,需按请求的 from/size 做分页切片
  1100 + if in_rank_window:
903 1101 hits = es_response.get("hits", {}).get("hits") or []
904 1102 sliced = hits[from_ : from_ + size]
905 1103 es_response.setdefault("hits", {})["hits"] = sliced
... ... @@ -961,12 +1159,12 @@ class Searcher:
961 1159 context.end_stage(RequestContextStage.ELASTICSEARCH_PAGE_FILL)
962 1160  
963 1161 context.logger.info(
964   - f"重排分页切片 | from={from_}, size={size}, 返回={len(sliced)}条",
  1162 + f"排序窗口分页切片 | from={from_}, size={size}, 返回={len(sliced)}条",
965 1163 extra={'reqid': context.reqid, 'uid': context.uid}
966 1164 )
967 1165  
968 1166 # 非重排窗口:款式意图在 result_processing 之前执行,便于单独计时且与 ES 召回阶段衔接
969   - if self._has_style_intent(parsed_query) and not in_rerank_window:
  1167 + if self._has_style_intent(parsed_query) and not in_rank_window:
970 1168 es_hits_pre = es_response.get("hits", {}).get("hits") or []
971 1169 style_intent_decisions = self._apply_style_intent_to_hits(
972 1170 es_hits_pre,
... ... @@ -1259,7 +1457,7 @@ class Searcher:
1259 1457 # Collect debug information if requested
1260 1458 debug_info = None
1261 1459 if debug:
1262   - query_tokens = getattr(parsed_query, "query_tokens", []) if parsed_query else []
  1460 + query_tokens = parsed_query.query_tokens if parsed_query else []
1263 1461 token_count = len(query_tokens)
1264 1462 text_knn_is_long = token_count >= 5
1265 1463 text_knn_k = self.query_builder.knn_text_k_long if text_knn_is_long else self.query_builder.knn_text_k
... ... @@ -1279,7 +1477,7 @@ class Searcher:
1279 1477 "translations": context.query_analysis.translations,
1280 1478 "keywords_queries": context.query_analysis.keywords_queries,
1281 1479 "has_vector": context.query_analysis.query_vector is not None,
1282   - "has_image_vector": getattr(parsed_query, "image_query_vector", None) is not None,
  1480 + "has_image_vector": parsed_query.image_query_vector is not None,
1283 1481 "query_tokens": query_tokens,
1284 1482 "intent_detection": context.get_intermediate_result("style_intent_profile"),
1285 1483 },
... ... @@ -1298,9 +1496,10 @@ class Searcher:
1298 1496 },
1299 1497 "image_knn": {
1300 1498 "enabled": bool(
1301   - enable_embedding
  1499 + self.image_embedding_field
  1500 + and enable_embedding
1302 1501 and parsed_query
1303   - and getattr(parsed_query, "image_query_vector", None) is not None
  1502 + and image_query_vector is not None
1304 1503 ),
1305 1504 "k": self.query_builder.knn_image_k,
1306 1505 "num_candidates": self.query_builder.knn_image_num_candidates,
... ... @@ -1311,9 +1510,14 @@ class Searcher:
1311 1510 "es_query_context": {
1312 1511 "es_fetch_from": es_fetch_from,
1313 1512 "es_fetch_size": es_fetch_size,
1314   - "in_rerank_window": in_rerank_window,
1315   - "rerank_prefetch_source": context.get_intermediate_result('es_query_rerank_prefetch_source'),
1316   - "include_named_queries_score": bool(do_rerank and in_rerank_window),
  1513 + "in_rank_window": in_rank_window,
  1514 + "include_named_queries_score": bool(in_rank_window),
  1515 + "exact_knn_rescore_enabled": bool(rc.exact_knn_rescore_enabled and in_rank_window),
  1516 + "exact_knn_rescore_window": (
  1517 + self._resolve_exact_knn_rescore_window()
  1518 + if rc.exact_knn_rescore_enabled and in_rank_window
  1519 + else None
  1520 + ),
1317 1521 },
1318 1522 "es_response": {
1319 1523 "took_ms": es_response.get('took', 0),
... ... @@ -1369,10 +1573,10 @@ class Searcher:
1369 1573 "retrieval_plan": debug_info["retrieval_plan"],
1370 1574 "ranking_windows": {
1371 1575 "es_fetch_size": es_fetch_size,
1372   - "coarse_output_window": coarse_output_window if do_rerank and in_rerank_window else None,
1373   - "fine_input_window": fine_input_window if do_rerank and in_rerank_window else None,
1374   - "fine_output_window": fine_output_window if do_rerank and in_rerank_window else None,
1375   - "rerank_window": rerank_window if do_rerank and in_rerank_window else None,
  1576 + "coarse_output_window": coarse_output_window if in_rank_window else None,
  1577 + "fine_input_window": fine_input_window if in_rank_window else None,
  1578 + "fine_output_window": fine_output_window if in_rank_window else None,
  1579 + "rerank_window": rerank_window if in_rank_window else None,
1376 1580 "page_from": from_,
1377 1581 "page_size": size,
1378 1582 },
... ...
suggestion/builder.py
... ... @@ -366,7 +366,8 @@ class SuggestionIndexBuilder:
366 366  
367 367 index_name = get_tenant_index_name(tenant_id)
368 368 search_after: Optional[List[Any]] = None
369   -
  369 + print(f"[DEBUG] Python using index: {index_name} for tenant {tenant_id}")
  370 + total_processed = 0
370 371 while True:
371 372 body: Dict[str, Any] = {
372 373 "size": batch_size,
... ... @@ -385,10 +386,13 @@ class SuggestionIndexBuilder:
385 386 if not hits:
386 387 break
387 388 for hit in hits:
  389 + total_processed += 1
388 390 yield hit
389 391 search_after = hits[-1].get("sort")
390 392 if len(hits) < batch_size:
391 393 break
  394 + print(f"[DEBUG] Python processed total products: {total_processed} for tenant {tenant_id}")
  395 +
392 396  
393 397 def _iter_query_log_rows(
394 398 self,
... ...
suggestion/builder.py.bak 0 → 100644
... ... @@ -0,0 +1,1014 @@
  1 +"""
  2 +Suggestion index builder (Phase 2).
  3 +
  4 +Capabilities:
  5 +- Full rebuild to versioned index
  6 +- Atomic alias publish
  7 +- Incremental update from query logs with watermark
  8 +"""
  9 +
  10 +import json
  11 +import logging
  12 +import math
  13 +import re
  14 +import unicodedata
  15 +from dataclasses import dataclass, field
  16 +from datetime import datetime, timedelta, timezone
  17 +from typing import Any, Dict, Iterator, List, Optional, Tuple
  18 +
  19 +from sqlalchemy import text
  20 +
  21 +from config.loader import get_app_config
  22 +from config.tenant_config_loader import get_tenant_config_loader
  23 +from query.query_parser import detect_text_language_for_suggestions
  24 +from suggestion.mapping import build_suggestion_mapping
  25 +from utils.es_client import ESClient
  26 +
  27 +logger = logging.getLogger(__name__)
  28 +
  29 +
  30 +def _index_prefix() -> str:
  31 + return get_app_config().runtime.index_namespace or ""
  32 +
  33 +
  34 +def get_suggestion_alias_name(tenant_id: str) -> str:
  35 + """Read alias for suggestion index (single source of truth)."""
  36 + return f"{_index_prefix()}search_suggestions_tenant_{tenant_id}_current"
  37 +
  38 +
  39 +def get_suggestion_versioned_index_name(tenant_id: str, build_at: Optional[datetime] = None) -> str:
  40 + """Versioned suggestion index name."""
  41 + ts = (build_at or datetime.now(timezone.utc)).strftime("%Y%m%d%H%M%S%f")
  42 + return f"{_index_prefix()}search_suggestions_tenant_{tenant_id}_v{ts}"
  43 +
  44 +
  45 +def get_suggestion_versioned_index_pattern(tenant_id: str) -> str:
  46 + return f"{_index_prefix()}search_suggestions_tenant_{tenant_id}_v*"
  47 +
  48 +
  49 +def get_suggestion_meta_index_name() -> str:
  50 + return f"{_index_prefix()}search_suggestions_meta"
  51 +
  52 +
  53 +@dataclass
  54 +class SuggestionCandidate:
  55 + text: str
  56 + text_norm: str
  57 + lang: str
  58 + sources: set = field(default_factory=set)
  59 + title_spu_ids: set = field(default_factory=set)
  60 + qanchor_spu_ids: set = field(default_factory=set)
  61 + tag_spu_ids: set = field(default_factory=set)
  62 + query_count_7d: int = 0
  63 + query_count_30d: int = 0
  64 + lang_confidence: float = 1.0
  65 + lang_source: str = "default"
  66 + lang_conflict: bool = False
  67 +
  68 + def add_product(self, source: str, spu_id: str) -> None:
  69 + self.sources.add(source)
  70 + if source == "title":
  71 + self.title_spu_ids.add(spu_id)
  72 + elif source == "qanchor":
  73 + self.qanchor_spu_ids.add(spu_id)
  74 + elif source == "tag":
  75 + self.tag_spu_ids.add(spu_id)
  76 +
  77 + def add_query_log(self, is_7d: bool) -> None:
  78 + self.sources.add("query_log")
  79 + self.query_count_30d += 1
  80 + if is_7d:
  81 + self.query_count_7d += 1
  82 +
  83 +
  84 +@dataclass
  85 +class QueryDelta:
  86 + tenant_id: str
  87 + lang: str
  88 + text: str
  89 + text_norm: str
  90 + delta_7d: int = 0
  91 + delta_30d: int = 0
  92 + lang_confidence: float = 1.0
  93 + lang_source: str = "default"
  94 + lang_conflict: bool = False
  95 +
  96 +
  97 +class SuggestionIndexBuilder:
  98 + """Build and update suggestion index."""
  99 +
  100 + def __init__(self, es_client: ESClient, db_engine: Any):
  101 + self.es_client = es_client
  102 + self.db_engine = db_engine
  103 +
  104 + def _format_allocation_failure(self, index_name: str) -> str:
  105 + health = self.es_client.wait_for_index_ready(index_name=index_name, timeout="5s")
  106 + explain = self.es_client.get_allocation_explain(index_name=index_name)
  107 +
  108 + parts = [
  109 + f"Suggestion index '{index_name}' was created but is not allocatable/readable yet",
  110 + f"health_status={health.get('status')}",
  111 + f"timed_out={health.get('timed_out')}",
  112 + ]
  113 + if health.get("error"):
  114 + parts.append(f"health_error={health['error']}")
  115 +
  116 + if explain:
  117 + unassigned = explain.get("unassigned_info") or {}
  118 + if unassigned.get("reason"):
  119 + parts.append(f"unassigned_reason={unassigned['reason']}")
  120 + if unassigned.get("last_allocation_status"):
  121 + parts.append(f"last_allocation_status={unassigned['last_allocation_status']}")
  122 +
  123 + for node in explain.get("node_allocation_decisions") or []:
  124 + node_name = node.get("node_name") or node.get("node_id") or "unknown-node"
  125 + for decider in node.get("deciders") or []:
  126 + if decider.get("decision") == "NO":
  127 + parts.append(
  128 + f"{node_name}:{decider.get('decider')}={decider.get('explanation')}"
  129 + )
  130 + return "; ".join(parts)
  131 +
  132 + return "; ".join(parts)
  133 +
  134 + def _create_fresh_versioned_index(
  135 + self,
  136 + tenant_id: str,
  137 + mapping: Dict[str, Any],
  138 + max_attempts: int = 5,
  139 + ) -> str:
  140 + for attempt in range(1, max_attempts + 1):
  141 + index_name = get_suggestion_versioned_index_name(tenant_id)
  142 + if self.es_client.index_exists(index_name):
  143 + logger.warning(
  144 + "Suggestion index name collision before create for tenant=%s index=%s attempt=%s/%s",
  145 + tenant_id,
  146 + index_name,
  147 + attempt,
  148 + max_attempts,
  149 + )
  150 + continue
  151 +
  152 + if self.es_client.create_index(index_name, mapping):
  153 + return index_name
  154 +
  155 + if self.es_client.index_exists(index_name):
  156 + logger.warning(
  157 + "Suggestion index name collision during create for tenant=%s index=%s attempt=%s/%s",
  158 + tenant_id,
  159 + index_name,
  160 + attempt,
  161 + max_attempts,
  162 + )
  163 + continue
  164 +
  165 + raise RuntimeError(f"Failed to create suggestion index: {index_name}")
  166 +
  167 + raise RuntimeError(
  168 + f"Failed to allocate a unique suggestion index name for tenant={tenant_id} after {max_attempts} attempts"
  169 + )
  170 +
  171 + def _ensure_new_index_ready(self, index_name: str) -> None:
  172 + health = self.es_client.wait_for_index_ready(index_name=index_name, timeout="5s")
  173 + if health.get("ok"):
  174 + return
  175 + raise RuntimeError(self._format_allocation_failure(index_name))
  176 +
  177 + @staticmethod
  178 + def _to_utc(dt: Any) -> Optional[datetime]:
  179 + if dt is None:
  180 + return None
  181 + if isinstance(dt, datetime):
  182 + if dt.tzinfo is None:
  183 + return dt.replace(tzinfo=timezone.utc)
  184 + return dt.astimezone(timezone.utc)
  185 + return None
  186 +
  187 + @staticmethod
  188 + def _normalize_text(value: str) -> str:
  189 + text_value = unicodedata.normalize("NFKC", (value or "")).strip().lower()
  190 + text_value = re.sub(r"\s+", " ", text_value)
  191 + return text_value
  192 +
  193 + @staticmethod
  194 + def _prepare_title_for_suggest(title: str, max_len: int = 120) -> str:
  195 + """
  196 + Keep title-derived suggestions concise:
  197 + - keep raw title when short enough
  198 + - for long titles, keep the leading phrase before common separators
  199 + - fallback to hard truncate
  200 + """
  201 + raw = str(title or "").strip()
  202 + if not raw:
  203 + return ""
  204 + if len(raw) <= max_len:
  205 + return raw
  206 +
  207 + head = re.split(r"[,,;;|/\\\\((\\[【]", raw, maxsplit=1)[0].strip()
  208 + if 1 < len(head) <= max_len:
  209 + return head
  210 +
  211 + truncated = raw[:max_len].rstrip(" ,,;;|/\\\\-—–()()[]【】")
  212 + return truncated or raw[:max_len]
  213 +
  214 + @staticmethod
  215 + def _split_qanchors(value: Any) -> List[str]:
  216 + if value is None:
  217 + return []
  218 + if isinstance(value, list):
  219 + return [str(x).strip() for x in value if str(x).strip()]
  220 + raw = str(value).strip()
  221 + if not raw:
  222 + return []
  223 + parts = re.split(r"[,、,;|/\n\t]+", raw)
  224 + out = [p.strip() for p in parts if p and p.strip()]
  225 + if not out:
  226 + return [raw]
  227 + return out
  228 +
  229 + @staticmethod
  230 + def _iter_product_tags(raw: Any) -> List[str]:
  231 + if raw is None:
  232 + return []
  233 + if isinstance(raw, list):
  234 + return [str(x).strip() for x in raw if str(x).strip()]
  235 + s = str(raw).strip()
  236 + if not s:
  237 + return []
  238 + parts = re.split(r"[,、,;|/\n\t]+", s)
  239 + out = [p.strip() for p in parts if p and p.strip()]
  240 + return out if out else [s]
  241 +
  242 + def _iter_multilang_product_tags(
  243 + self,
  244 + raw: Any,
  245 + index_languages: List[str],
  246 + primary_language: str,
  247 + ) -> List[Tuple[str, str]]:
  248 + if isinstance(raw, dict):
  249 + pairs: List[Tuple[str, str]] = []
  250 + for lang in index_languages:
  251 + for tag in self._iter_product_tags(raw.get(lang)):
  252 + pairs.append((lang, tag))
  253 + return pairs
  254 +
  255 + pairs = []
  256 + for tag in self._iter_product_tags(raw):
  257 + tag_lang, _, _ = detect_text_language_for_suggestions(
  258 + tag,
  259 + index_languages=index_languages,
  260 + primary_language=primary_language,
  261 + )
  262 + pairs.append((tag_lang, tag))
  263 + return pairs
  264 +
  265 + @staticmethod
  266 + def _looks_noise(text_value: str) -> bool:
  267 + if not text_value:
  268 + return True
  269 + if len(text_value) > 120:
  270 + return True
  271 + if re.fullmatch(r"[\W_]+", text_value):
  272 + return True
  273 + return False
  274 +
  275 + @staticmethod
  276 + def _normalize_lang(lang: Optional[str]) -> Optional[str]:
  277 + if not lang:
  278 + return None
  279 + token = str(lang).strip().lower().replace("-", "_")
  280 + if not token:
  281 + return None
  282 + if token in {"zh_tw", "pt_br"}:
  283 + return token
  284 + return token.split("_")[0]
  285 +
  286 + @staticmethod
  287 + def _parse_request_params_language(raw: Any) -> Optional[str]:
  288 + if raw is None:
  289 + return None
  290 + if isinstance(raw, dict):
  291 + return raw.get("language")
  292 + text_raw = str(raw).strip()
  293 + if not text_raw:
  294 + return None
  295 + try:
  296 + obj = json.loads(text_raw)
  297 + if isinstance(obj, dict):
  298 + return obj.get("language")
  299 + except Exception:
  300 + return None
  301 + return None
  302 +
  303 + def _resolve_query_language(
  304 + self,
  305 + query: str,
  306 + log_language: Optional[str],
  307 + request_params: Any,
  308 + index_languages: List[str],
  309 + primary_language: str,
  310 + ) -> Tuple[str, float, str, bool]:
  311 + """Resolve lang with priority: log field > request_params > script/model."""
  312 + langs_set = set(index_languages or [])
  313 + primary = self._normalize_lang(primary_language) or "en"
  314 + if primary not in langs_set and langs_set:
  315 + primary = index_languages[0]
  316 +
  317 + log_lang = self._normalize_lang(log_language)
  318 + req_lang = self._normalize_lang(self._parse_request_params_language(request_params))
  319 + conflict = bool(log_lang and req_lang and log_lang != req_lang)
  320 +
  321 + if log_lang and (not langs_set or log_lang in langs_set):
  322 + return log_lang, 1.0, "log_field", conflict
  323 +
  324 + if req_lang and (not langs_set or req_lang in langs_set):
  325 + return req_lang, 1.0, "request_params", conflict
  326 +
  327 + det_lang, conf, det_source = detect_text_language_for_suggestions(
  328 + query,
  329 + index_languages=index_languages,
  330 + primary_language=primary,
  331 + )
  332 + if det_lang and (not langs_set or det_lang in langs_set):
  333 + return det_lang, conf, det_source, conflict
  334 +
  335 + return primary, 0.3, "default", conflict
  336 +
  337 + @staticmethod
  338 + def _compute_rank_score(
  339 + query_count_30d: int,
  340 + query_count_7d: int,
  341 + qanchor_doc_count: int,
  342 + title_doc_count: int,
  343 + tag_doc_count: int = 0,
  344 + ) -> float:
  345 + return (
  346 + 1.8 * math.log1p(max(query_count_30d, 0))
  347 + + 1.2 * math.log1p(max(query_count_7d, 0))
  348 + + 1.0 * math.log1p(max(qanchor_doc_count, 0))
  349 + + 0.85 * math.log1p(max(tag_doc_count, 0))
  350 + + 0.6 * math.log1p(max(title_doc_count, 0))
  351 + )
  352 +
  353 + @classmethod
  354 + def _compute_rank_score_from_candidate(cls, c: SuggestionCandidate) -> float:
  355 + return cls._compute_rank_score(
  356 + query_count_30d=c.query_count_30d,
  357 + query_count_7d=c.query_count_7d,
  358 + qanchor_doc_count=len(c.qanchor_spu_ids),
  359 + title_doc_count=len(c.title_spu_ids),
  360 + tag_doc_count=len(c.tag_spu_ids),
  361 + )
  362 +
  363 + def _iter_products(self, tenant_id: str, batch_size: int = 500) -> Iterator[Dict[str, Any]]:
  364 + """Stream product docs from tenant index using search_after."""
  365 + from indexer.mapping_generator import get_tenant_index_name
  366 +
  367 + index_name = get_tenant_index_name(tenant_id)
  368 + search_after: Optional[List[Any]] = None
  369 +
  370 + while True:
  371 + body: Dict[str, Any] = {
  372 + "size": batch_size,
  373 + "_source": ["id", "spu_id", "title", "qanchors", "enriched_tags"],
  374 + "sort": [
  375 + {"spu_id": {"order": "asc", "missing": "_last"}},
  376 + {"id.keyword": {"order": "asc", "missing": "_last"}},
  377 + ],
  378 + "query": {"match_all": {}},
  379 + }
  380 + if search_after is not None:
  381 + body["search_after"] = search_after
  382 +
  383 + resp = self.es_client.client.search(index=index_name, body=body)
  384 + hits = resp.get("hits", {}).get("hits", []) or []
  385 + if not hits:
  386 + break
  387 + for hit in hits:
  388 + yield hit
  389 + search_after = hits[-1].get("sort")
  390 + if len(hits) < batch_size:
  391 + break
  392 +
  393 + def _iter_query_log_rows(
  394 + self,
  395 + tenant_id: str,
  396 + since: datetime,
  397 + until: datetime,
  398 + fetch_size: int = 2000,
  399 + ) -> Iterator[Any]:
  400 + """Stream search logs from MySQL with bounded time range."""
  401 + query_sql = text(
  402 + """
  403 + SELECT query, language, request_params, create_time
  404 + FROM shoplazza_search_log
  405 + WHERE tenant_id = :tenant_id
  406 + AND deleted = 0
  407 + AND query IS NOT NULL
  408 + AND query <> ''
  409 + AND create_time >= :since_time
  410 + AND create_time < :until_time
  411 + ORDER BY create_time ASC
  412 + """
  413 + )
  414 +
  415 + with self.db_engine.connect().execution_options(stream_results=True) as conn:
  416 + result = conn.execute(
  417 + query_sql,
  418 + {
  419 + "tenant_id": int(tenant_id),
  420 + "since_time": since,
  421 + "until_time": until,
  422 + },
  423 + )
  424 + while True:
  425 + rows = result.fetchmany(fetch_size)
  426 + if not rows:
  427 + break
  428 + for row in rows:
  429 + yield row
  430 +
  431 + def _ensure_meta_index(self) -> str:
  432 + meta_index = get_suggestion_meta_index_name()
  433 + if self.es_client.index_exists(meta_index):
  434 + return meta_index
  435 + body = {
  436 + "settings": {
  437 + "number_of_shards": 1,
  438 + "number_of_replicas": 0,
  439 + "refresh_interval": "1s",
  440 + },
  441 + "mappings": {
  442 + "properties": {
  443 + "tenant_id": {"type": "keyword"},
  444 + "active_alias": {"type": "keyword"},
  445 + "active_index": {"type": "keyword"},
  446 + "last_full_build_at": {"type": "date"},
  447 + "last_incremental_build_at": {"type": "date"},
  448 + "last_incremental_watermark": {"type": "date"},
  449 + "updated_at": {"type": "date"},
  450 + }
  451 + },
  452 + }
  453 + if not self.es_client.create_index(meta_index, body):
  454 + raise RuntimeError(f"Failed to create suggestion meta index: {meta_index}")
  455 + return meta_index
  456 +
  457 + def _get_meta(self, tenant_id: str) -> Dict[str, Any]:
  458 + meta_index = self._ensure_meta_index()
  459 + try:
  460 + resp = self.es_client.client.get(index=meta_index, id=str(tenant_id))
  461 + return resp.get("_source", {}) or {}
  462 + except Exception:
  463 + return {}
  464 +
  465 + def _upsert_meta(self, tenant_id: str, patch: Dict[str, Any]) -> None:
  466 + meta_index = self._ensure_meta_index()
  467 + current = self._get_meta(tenant_id)
  468 + now_iso = datetime.now(timezone.utc).isoformat()
  469 + merged = {
  470 + "tenant_id": str(tenant_id),
  471 + **current,
  472 + **patch,
  473 + "updated_at": now_iso,
  474 + }
  475 + self.es_client.client.index(index=meta_index, id=str(tenant_id), document=merged, refresh="wait_for")
  476 +
  477 + def _cleanup_old_versions(self, tenant_id: str, keep_versions: int, protected_indices: Optional[List[str]] = None) -> List[str]:
  478 + if keep_versions < 1:
  479 + keep_versions = 1
  480 + protected = set(protected_indices or [])
  481 + pattern = get_suggestion_versioned_index_pattern(tenant_id)
  482 + all_indices = self.es_client.list_indices(pattern)
  483 + if len(all_indices) <= keep_versions:
  484 + return []
  485 +
  486 + # Names are timestamp-ordered by suffix; keep newest N.
  487 + kept = set(sorted(all_indices)[-keep_versions:])
  488 + dropped: List[str] = []
  489 + for idx in sorted(all_indices):
  490 + if idx in kept or idx in protected:
  491 + continue
  492 + if self.es_client.delete_index(idx):
  493 + dropped.append(idx)
  494 + return dropped
  495 +
  496 + def _publish_alias(self, tenant_id: str, index_name: str, keep_versions: int = 2) -> Dict[str, Any]:
  497 + alias_name = get_suggestion_alias_name(tenant_id)
  498 + current_indices = self.es_client.get_alias_indices(alias_name)
  499 +
  500 + actions: List[Dict[str, Any]] = []
  501 + for idx in current_indices:
  502 + actions.append({"remove": {"index": idx, "alias": alias_name}})
  503 + actions.append({"add": {"index": index_name, "alias": alias_name}})
  504 +
  505 + if not self.es_client.update_aliases(actions):
  506 + raise RuntimeError(f"Failed to publish alias {alias_name} -> {index_name}")
  507 +
  508 + dropped = self._cleanup_old_versions(
  509 + tenant_id=tenant_id,
  510 + keep_versions=keep_versions,
  511 + protected_indices=[index_name],
  512 + )
  513 +
  514 + self._upsert_meta(
  515 + tenant_id,
  516 + {
  517 + "active_alias": alias_name,
  518 + "active_index": index_name,
  519 + },
  520 + )
  521 +
  522 + return {
  523 + "alias": alias_name,
  524 + "previous_indices": current_indices,
  525 + "current_index": index_name,
  526 + "dropped_old_indices": dropped,
  527 + }
  528 +
  529 + def _resolve_incremental_target_index(self, tenant_id: str) -> Optional[str]:
  530 + """Resolve active suggestion index for incremental updates (alias only)."""
  531 + alias_name = get_suggestion_alias_name(tenant_id)
  532 + aliased = self.es_client.get_alias_indices(alias_name)
  533 + if aliased:
  534 + # alias should map to one index in this design
  535 + return sorted(aliased)[-1]
  536 + return None
  537 +
  538 + def _build_full_candidates(
  539 + self,
  540 + tenant_id: str,
  541 + index_languages: List[str],
  542 + primary_language: str,
  543 + days: int,
  544 + batch_size: int,
  545 + min_query_len: int,
  546 + ) -> Dict[Tuple[str, str], SuggestionCandidate]:
  547 + key_to_candidate: Dict[Tuple[str, str], SuggestionCandidate] = {}
  548 +
  549 + # Step 1: product title/qanchors
  550 + for hit in self._iter_products(tenant_id, batch_size=batch_size):
  551 + src = hit.get("_source", {}) or {}
  552 + product_id = str(src.get("spu_id") or src.get("id") or hit.get("_id") or "")
  553 + if not product_id:
  554 + continue
  555 + title_obj = src.get("title") or {}
  556 + qanchor_obj = src.get("qanchors") or {}
  557 +
  558 + for lang in index_languages:
  559 + title = ""
  560 + if isinstance(title_obj, dict):
  561 + title = self._prepare_title_for_suggest(title_obj.get(lang) or "")
  562 + if title:
  563 + text_norm = self._normalize_text(title)
  564 + if not self._looks_noise(text_norm):
  565 + key = (lang, text_norm)
  566 + c = key_to_candidate.get(key)
  567 + if c is None:
  568 + c = SuggestionCandidate(text=title, text_norm=text_norm, lang=lang)
  569 + key_to_candidate[key] = c
  570 + c.add_product("title", spu_id=product_id)
  571 +
  572 + q_raw = None
  573 + if isinstance(qanchor_obj, dict):
  574 + q_raw = qanchor_obj.get(lang)
  575 + for q_text in self._split_qanchors(q_raw):
  576 + text_norm = self._normalize_text(q_text)
  577 + if self._looks_noise(text_norm):
  578 + continue
  579 + key = (lang, text_norm)
  580 + c = key_to_candidate.get(key)
  581 + if c is None:
  582 + c = SuggestionCandidate(text=q_text, text_norm=text_norm, lang=lang)
  583 + key_to_candidate[key] = c
  584 + c.add_product("qanchor", spu_id=product_id)
  585 +
  586 + for tag_lang, tag in self._iter_multilang_product_tags(
  587 + src.get("enriched_tags"),
  588 + index_languages=index_languages,
  589 + primary_language=primary_language,
  590 + ):
  591 + text_norm = self._normalize_text(tag)
  592 + if self._looks_noise(text_norm):
  593 + continue
  594 + key = (tag_lang, text_norm)
  595 + c = key_to_candidate.get(key)
  596 + if c is None:
  597 + c = SuggestionCandidate(text=tag, text_norm=text_norm, lang=tag_lang)
  598 + key_to_candidate[key] = c
  599 + c.add_product("tag", spu_id=product_id)
  600 +
  601 + # Step 2: query logs
  602 + now = datetime.now(timezone.utc)
  603 + since = now - timedelta(days=days)
  604 + since_7d = now - timedelta(days=7)
  605 +
  606 + for row in self._iter_query_log_rows(tenant_id=tenant_id, since=since, until=now):
  607 + q = str(row.query or "").strip()
  608 + if len(q) < min_query_len:
  609 + continue
  610 +
  611 + lang, conf, source, conflict = self._resolve_query_language(
  612 + query=q,
  613 + log_language=getattr(row, "language", None),
  614 + request_params=getattr(row, "request_params", None),
  615 + index_languages=index_languages,
  616 + primary_language=primary_language,
  617 + )
  618 + text_norm = self._normalize_text(q)
  619 + if self._looks_noise(text_norm):
  620 + continue
  621 +
  622 + key = (lang, text_norm)
  623 + c = key_to_candidate.get(key)
  624 + if c is None:
  625 + c = SuggestionCandidate(text=q, text_norm=text_norm, lang=lang)
  626 + key_to_candidate[key] = c
  627 +
  628 + c.lang_confidence = max(c.lang_confidence, conf)
  629 + c.lang_source = source if c.lang_source == "default" else c.lang_source
  630 + c.lang_conflict = c.lang_conflict or conflict
  631 +
  632 + created_at = self._to_utc(getattr(row, "create_time", None))
  633 + is_7d = bool(created_at and created_at >= since_7d)
  634 + c.add_query_log(is_7d=is_7d)
  635 +
  636 + return key_to_candidate
  637 +
  638 + def _candidate_to_doc(self, tenant_id: str, c: SuggestionCandidate, now_iso: str) -> Dict[str, Any]:
  639 + rank_score = self._compute_rank_score_from_candidate(c)
  640 + completion_obj = {c.lang: {"input": [c.text], "weight": int(max(rank_score, 1.0) * 100)}}
  641 + sat_obj = {c.lang: c.text}
  642 + return {
  643 + "_id": f"{tenant_id}|{c.lang}|{c.text_norm}",
  644 + "tenant_id": str(tenant_id),
  645 + "lang": c.lang,
  646 + "text": c.text,
  647 + "text_norm": c.text_norm,
  648 + "sources": sorted(c.sources),
  649 + "title_doc_count": len(c.title_spu_ids),
  650 + "qanchor_doc_count": len(c.qanchor_spu_ids),
  651 + "tag_doc_count": len(c.tag_spu_ids),
  652 + "query_count_7d": c.query_count_7d,
  653 + "query_count_30d": c.query_count_30d,
  654 + "rank_score": float(rank_score),
  655 + "lang_confidence": float(c.lang_confidence),
  656 + "lang_source": c.lang_source,
  657 + "lang_conflict": bool(c.lang_conflict),
  658 + "status": 1,
  659 + "updated_at": now_iso,
  660 + "completion": completion_obj,
  661 + "sat": sat_obj,
  662 + }
  663 +
  664 + def rebuild_tenant_index(
  665 + self,
  666 + tenant_id: str,
  667 + days: int = 365,
  668 + batch_size: int = 500,
  669 + min_query_len: int = 1,
  670 + publish_alias: bool = True,
  671 + keep_versions: int = 2,
  672 + ) -> Dict[str, Any]:
  673 + """
  674 + Full rebuild.
  675 +
  676 + Phase2 default behavior:
  677 + - write to versioned index
  678 + - atomically publish alias
  679 + """
  680 + tenant_loader = get_tenant_config_loader()
  681 + tenant_cfg = tenant_loader.get_tenant_config(tenant_id)
  682 + index_languages: List[str] = tenant_cfg.get("index_languages") or ["en", "zh"]
  683 + primary_language: str = tenant_cfg.get("primary_language") or "en"
  684 +
  685 + alias_publish: Optional[Dict[str, Any]] = None
  686 + index_name: Optional[str] = None
  687 + try:
  688 + mapping = build_suggestion_mapping(index_languages=index_languages)
  689 + index_name = self._create_fresh_versioned_index(
  690 + tenant_id=tenant_id,
  691 + mapping=mapping,
  692 + )
  693 + self._ensure_new_index_ready(index_name)
  694 +
  695 + key_to_candidate = self._build_full_candidates(
  696 + tenant_id=tenant_id,
  697 + index_languages=index_languages,
  698 + primary_language=primary_language,
  699 + days=days,
  700 + batch_size=batch_size,
  701 + min_query_len=min_query_len,
  702 + )
  703 +
  704 + now_iso = datetime.now(timezone.utc).isoformat()
  705 + docs = [self._candidate_to_doc(tenant_id, c, now_iso) for c in key_to_candidate.values()]
  706 +
  707 + if docs:
  708 + bulk_result = self.es_client.bulk_index(index_name=index_name, docs=docs)
  709 + self.es_client.refresh(index_name)
  710 + else:
  711 + bulk_result = {"success": 0, "failed": 0, "errors": []}
  712 +
  713 + if publish_alias:
  714 + alias_publish = self._publish_alias(
  715 + tenant_id=tenant_id,
  716 + index_name=index_name,
  717 + keep_versions=keep_versions,
  718 + )
  719 +
  720 + now_utc = datetime.now(timezone.utc).isoformat()
  721 + meta_patch: Dict[str, Any] = {
  722 + "last_full_build_at": now_utc,
  723 + "last_incremental_watermark": now_utc,
  724 + }
  725 + if publish_alias:
  726 + meta_patch["active_index"] = index_name
  727 + meta_patch["active_alias"] = get_suggestion_alias_name(tenant_id)
  728 + self._upsert_meta(tenant_id, meta_patch)
  729 +
  730 + return {
  731 + "mode": "full",
  732 + "tenant_id": str(tenant_id),
  733 + "index_name": index_name,
  734 + "alias_published": bool(alias_publish),
  735 + "alias_publish": alias_publish,
  736 + "total_candidates": len(key_to_candidate),
  737 + "indexed_docs": len(docs),
  738 + "bulk_result": bulk_result,
  739 + }
  740 + except Exception:
  741 + if index_name and not alias_publish:
  742 + self.es_client.delete_index(index_name)
  743 + raise
  744 +
  745 + def _build_incremental_deltas(
  746 + self,
  747 + tenant_id: str,
  748 + index_languages: List[str],
  749 + primary_language: str,
  750 + since: datetime,
  751 + until: datetime,
  752 + min_query_len: int,
  753 + ) -> Dict[Tuple[str, str], QueryDelta]:
  754 + now = datetime.now(timezone.utc)
  755 + since_7d = now - timedelta(days=7)
  756 + deltas: Dict[Tuple[str, str], QueryDelta] = {}
  757 +
  758 + for row in self._iter_query_log_rows(tenant_id=tenant_id, since=since, until=until):
  759 + q = str(row.query or "").strip()
  760 + if len(q) < min_query_len:
  761 + continue
  762 +
  763 + lang, conf, source, conflict = self._resolve_query_language(
  764 + query=q,
  765 + log_language=getattr(row, "language", None),
  766 + request_params=getattr(row, "request_params", None),
  767 + index_languages=index_languages,
  768 + primary_language=primary_language,
  769 + )
  770 + text_norm = self._normalize_text(q)
  771 + if self._looks_noise(text_norm):
  772 + continue
  773 +
  774 + key = (lang, text_norm)
  775 + item = deltas.get(key)
  776 + if item is None:
  777 + item = QueryDelta(
  778 + tenant_id=str(tenant_id),
  779 + lang=lang,
  780 + text=q,
  781 + text_norm=text_norm,
  782 + lang_confidence=conf,
  783 + lang_source=source,
  784 + lang_conflict=conflict,
  785 + )
  786 + deltas[key] = item
  787 +
  788 + created_at = self._to_utc(getattr(row, "create_time", None))
  789 + item.delta_30d += 1
  790 + if created_at and created_at >= since_7d:
  791 + item.delta_7d += 1
  792 +
  793 + if conf > item.lang_confidence:
  794 + item.lang_confidence = conf
  795 + item.lang_source = source
  796 + item.lang_conflict = item.lang_conflict or conflict
  797 +
  798 + return deltas
  799 +
  800 + def _delta_to_upsert_doc(self, delta: QueryDelta, now_iso: str) -> Dict[str, Any]:
  801 + rank_score = self._compute_rank_score(
  802 + query_count_30d=delta.delta_30d,
  803 + query_count_7d=delta.delta_7d,
  804 + qanchor_doc_count=0,
  805 + title_doc_count=0,
  806 + tag_doc_count=0,
  807 + )
  808 + return {
  809 + "tenant_id": delta.tenant_id,
  810 + "lang": delta.lang,
  811 + "text": delta.text,
  812 + "text_norm": delta.text_norm,
  813 + "sources": ["query_log"],
  814 + "title_doc_count": 0,
  815 + "qanchor_doc_count": 0,
  816 + "tag_doc_count": 0,
  817 + "query_count_7d": delta.delta_7d,
  818 + "query_count_30d": delta.delta_30d,
  819 + "rank_score": float(rank_score),
  820 + "lang_confidence": float(delta.lang_confidence),
  821 + "lang_source": delta.lang_source,
  822 + "lang_conflict": bool(delta.lang_conflict),
  823 + "status": 1,
  824 + "updated_at": now_iso,
  825 + "completion": {
  826 + delta.lang: {
  827 + "input": [delta.text],
  828 + "weight": int(max(rank_score, 1.0) * 100),
  829 + }
  830 + },
  831 + "sat": {delta.lang: delta.text},
  832 + }
  833 +
  834 + @staticmethod
  835 + def _build_incremental_update_script() -> str:
  836 + return """
  837 + if (ctx._source == null || ctx._source.isEmpty()) {
  838 + ctx._source = params.upsert;
  839 + return;
  840 + }
  841 +
  842 + if (ctx._source.query_count_30d == null) { ctx._source.query_count_30d = 0; }
  843 + if (ctx._source.query_count_7d == null) { ctx._source.query_count_7d = 0; }
  844 + if (ctx._source.qanchor_doc_count == null) { ctx._source.qanchor_doc_count = 0; }
  845 + if (ctx._source.title_doc_count == null) { ctx._source.title_doc_count = 0; }
  846 + if (ctx._source.tag_doc_count == null) { ctx._source.tag_doc_count = 0; }
  847 +
  848 + ctx._source.query_count_30d += params.delta_30d;
  849 + ctx._source.query_count_7d += params.delta_7d;
  850 +
  851 + if (ctx._source.sources == null) { ctx._source.sources = new ArrayList(); }
  852 + if (!ctx._source.sources.contains('query_log')) { ctx._source.sources.add('query_log'); }
  853 +
  854 + if (ctx._source.lang_conflict == null) { ctx._source.lang_conflict = false; }
  855 + ctx._source.lang_conflict = ctx._source.lang_conflict || params.lang_conflict;
  856 +
  857 + if (ctx._source.lang_confidence == null || params.lang_confidence > ctx._source.lang_confidence) {
  858 + ctx._source.lang_confidence = params.lang_confidence;
  859 + ctx._source.lang_source = params.lang_source;
  860 + }
  861 +
  862 + int q30 = ctx._source.query_count_30d;
  863 + int q7 = ctx._source.query_count_7d;
  864 + int qa = ctx._source.qanchor_doc_count;
  865 + int td = ctx._source.title_doc_count;
  866 + int tg = ctx._source.tag_doc_count;
  867 +
  868 + double score = 1.8 * Math.log(1 + q30)
  869 + + 1.2 * Math.log(1 + q7)
  870 + + 1.0 * Math.log(1 + qa)
  871 + + 0.85 * Math.log(1 + tg)
  872 + + 0.6 * Math.log(1 + td);
  873 + ctx._source.rank_score = score;
  874 + ctx._source.status = 1;
  875 + ctx._source.updated_at = params.now_iso;
  876 + ctx._source.text = params.text;
  877 + ctx._source.lang = params.lang;
  878 + ctx._source.text_norm = params.text_norm;
  879 +
  880 + if (ctx._source.completion == null) { ctx._source.completion = new HashMap(); }
  881 + Map c = new HashMap();
  882 + c.put('input', params.completion_input);
  883 + c.put('weight', params.completion_weight);
  884 + ctx._source.completion.put(params.lang, c);
  885 +
  886 + if (ctx._source.sat == null) { ctx._source.sat = new HashMap(); }
  887 + ctx._source.sat.put(params.lang, params.text);
  888 + """
  889 +
  890 + def _build_incremental_actions(self, target_index: str, deltas: Dict[Tuple[str, str], QueryDelta]) -> List[Dict[str, Any]]:
  891 + now_iso = datetime.now(timezone.utc).isoformat()
  892 + script_source = self._build_incremental_update_script()
  893 + actions: List[Dict[str, Any]] = []
  894 +
  895 + for delta in deltas.values():
  896 + upsert_doc = self._delta_to_upsert_doc(delta=delta, now_iso=now_iso)
  897 + upsert_rank = float(upsert_doc.get("rank_score") or 0.0)
  898 + action = {
  899 + "_op_type": "update",
  900 + "_index": target_index,
  901 + "_id": f"{delta.tenant_id}|{delta.lang}|{delta.text_norm}",
  902 + "scripted_upsert": True,
  903 + "script": {
  904 + "lang": "painless",
  905 + "source": script_source,
  906 + "params": {
  907 + "delta_30d": int(delta.delta_30d),
  908 + "delta_7d": int(delta.delta_7d),
  909 + "lang_confidence": float(delta.lang_confidence),
  910 + "lang_source": delta.lang_source,
  911 + "lang_conflict": bool(delta.lang_conflict),
  912 + "now_iso": now_iso,
  913 + "lang": delta.lang,
  914 + "text": delta.text,
  915 + "text_norm": delta.text_norm,
  916 + "completion_input": [delta.text],
  917 + "completion_weight": int(max(upsert_rank, 1.0) * 100),
  918 + "upsert": upsert_doc,
  919 + },
  920 + },
  921 + "upsert": upsert_doc,
  922 + }
  923 + actions.append(action)
  924 +
  925 + return actions
  926 +
  927 + def incremental_update_tenant_index(
  928 + self,
  929 + tenant_id: str,
  930 + min_query_len: int = 1,
  931 + fallback_days: int = 7,
  932 + overlap_minutes: int = 30,
  933 + bootstrap_if_missing: bool = True,
  934 + bootstrap_days: int = 30,
  935 + batch_size: int = 500,
  936 + ) -> Dict[str, Any]:
  937 + tenant_loader = get_tenant_config_loader()
  938 + tenant_cfg = tenant_loader.get_tenant_config(tenant_id)
  939 + index_languages: List[str] = tenant_cfg.get("index_languages") or ["en", "zh"]
  940 + primary_language: str = tenant_cfg.get("primary_language") or "en"
  941 +
  942 + target_index = self._resolve_incremental_target_index(tenant_id)
  943 + if not target_index:
  944 + if not bootstrap_if_missing:
  945 + raise RuntimeError(
  946 + f"No active suggestion index for tenant={tenant_id}. "
  947 + "Run full rebuild first or enable bootstrap_if_missing."
  948 + )
  949 + full_result = self.rebuild_tenant_index(
  950 + tenant_id=tenant_id,
  951 + days=bootstrap_days,
  952 + batch_size=batch_size,
  953 + min_query_len=min_query_len,
  954 + publish_alias=True
  955 + )
  956 + return {
  957 + "mode": "incremental",
  958 + "tenant_id": str(tenant_id),
  959 + "bootstrapped": True,
  960 + "bootstrap_result": full_result,
  961 + }
  962 +
  963 + meta = self._get_meta(tenant_id)
  964 + watermark_raw = meta.get("last_incremental_watermark") or meta.get("last_full_build_at")
  965 + now = datetime.now(timezone.utc)
  966 + default_since = now - timedelta(days=fallback_days)
  967 + since = None
  968 + if isinstance(watermark_raw, str) and watermark_raw.strip():
  969 + try:
  970 + since = self._to_utc(datetime.fromisoformat(watermark_raw.replace("Z", "+00:00")))
  971 + except Exception:
  972 + since = None
  973 + if since is None:
  974 + since = default_since
  975 + since = since - timedelta(minutes=max(overlap_minutes, 0))
  976 + if since < default_since:
  977 + since = default_since
  978 +
  979 + deltas = self._build_incremental_deltas(
  980 + tenant_id=tenant_id,
  981 + index_languages=index_languages,
  982 + primary_language=primary_language,
  983 + since=since,
  984 + until=now,
  985 + min_query_len=min_query_len,
  986 + )
  987 +
  988 + actions = self._build_incremental_actions(target_index=target_index, deltas=deltas)
  989 + bulk_result = self.es_client.bulk_actions(actions)
  990 + self.es_client.refresh(target_index)
  991 +
  992 + now_iso = now.isoformat()
  993 + self._upsert_meta(
  994 + tenant_id,
  995 + {
  996 + "last_incremental_build_at": now_iso,
  997 + "last_incremental_watermark": now_iso,
  998 + "active_index": target_index,
  999 + "active_alias": get_suggestion_alias_name(tenant_id),
  1000 + },
  1001 + )
  1002 +
  1003 + return {
  1004 + "mode": "incremental",
  1005 + "tenant_id": str(tenant_id),
  1006 + "target_index": target_index,
  1007 + "query_window": {
  1008 + "since": since.isoformat(),
  1009 + "until": now_iso,
  1010 + "overlap_minutes": int(overlap_minutes),
  1011 + },
  1012 + "updated_terms": len(deltas),
  1013 + "bulk_result": bulk_result,
  1014 + }
... ...
tests/ci/test_service_api_contracts.py
... ... @@ -345,8 +345,15 @@ def test_indexer_build_docs_from_db_contract(indexer_client: TestClient):
345 345 def test_indexer_enrich_content_contract(indexer_client: TestClient, monkeypatch):
346 346 import indexer.product_enrich as process_products
347 347  
348   - def _fake_build_index_content_fields(items: List[Dict[str, str]], tenant_id: str | None = None):
  348 + def _fake_build_index_content_fields(
  349 + items: List[Dict[str, str]],
  350 + tenant_id: str | None = None,
  351 + enrichment_scopes: List[str] | None = None,
  352 + category_taxonomy_profile: str = "apparel",
  353 + ):
349 354 assert tenant_id == "162"
  355 + assert enrichment_scopes == ["generic", "category_taxonomy"]
  356 + assert category_taxonomy_profile == "apparel"
350 357 return [
351 358 {
352 359 "id": p["spu_id"],
... ... @@ -358,6 +365,9 @@ def test_indexer_enrich_content_contract(indexer_client: TestClient, monkeypatch
358 365 "enriched_attributes": [
359 366 {"name": "enriched_tags", "value": {"zh": ["tag1"], "en": ["tag1"]}},
360 367 ],
  368 + "enriched_taxonomy_attributes": [
  369 + {"name": "Product Type", "value": {"zh": ["T恤"], "en": ["t-shirt"]}},
  370 + ],
361 371 }
362 372 for p in items
363 373 ]
... ... @@ -368,6 +378,8 @@ def test_indexer_enrich_content_contract(indexer_client: TestClient, monkeypatch
368 378 "/indexer/enrich-content",
369 379 json={
370 380 "tenant_id": "162",
  381 + "enrichment_scopes": ["generic", "category_taxonomy"],
  382 + "category_taxonomy_profile": "apparel",
371 383 "items": [
372 384 {"spu_id": "1001", "title": "T-shirt"},
373 385 {"spu_id": "1002", "title": "Toy"},
... ... @@ -377,6 +389,8 @@ def test_indexer_enrich_content_contract(indexer_client: TestClient, monkeypatch
377 389 assert response.status_code == 200
378 390 data = response.json()
379 391 assert data["tenant_id"] == "162"
  392 + assert data["enrichment_scopes"] == ["generic", "category_taxonomy"]
  393 + assert data["category_taxonomy_profile"] == "apparel"
380 394 assert data["total"] == 2
381 395 assert len(data["results"]) == 2
382 396 assert data["results"][0]["spu_id"] == "1001"
... ... @@ -388,6 +402,102 @@ def test_indexer_enrich_content_contract(indexer_client: TestClient, monkeypatch
388 402 "name": "enriched_tags",
389 403 "value": {"zh": ["tag1"], "en": ["tag1"]},
390 404 }
  405 + assert data["results"][0]["enriched_taxonomy_attributes"][0] == {
  406 + "name": "Product Type",
  407 + "value": {"zh": ["T恤"], "en": ["t-shirt"]},
  408 + }
  409 +
  410 +
  411 +def test_indexer_enrich_content_contract_accepts_deprecated_analysis_kinds(indexer_client: TestClient, monkeypatch):
  412 + import indexer.product_enrich as process_products
  413 +
  414 + seen: Dict[str, Any] = {}
  415 +
  416 + def _fake_build_index_content_fields(
  417 + items: List[Dict[str, str]],
  418 + tenant_id: str | None = None,
  419 + enrichment_scopes: List[str] | None = None,
  420 + category_taxonomy_profile: str = "apparel",
  421 + ):
  422 + seen["tenant_id"] = tenant_id
  423 + seen["enrichment_scopes"] = enrichment_scopes
  424 + seen["category_taxonomy_profile"] = category_taxonomy_profile
  425 + return [
  426 + {
  427 + "id": items[0]["spu_id"],
  428 + "qanchors": {},
  429 + "enriched_tags": {},
  430 + "enriched_attributes": [],
  431 + "enriched_taxonomy_attributes": [],
  432 + }
  433 + ]
  434 +
  435 + monkeypatch.setattr(process_products, "build_index_content_fields", _fake_build_index_content_fields)
  436 +
  437 + response = indexer_client.post(
  438 + "/indexer/enrich-content",
  439 + json={
  440 + "tenant_id": "162",
  441 + "analysis_kinds": ["taxonomy"],
  442 + "items": [{"spu_id": "1001", "title": "T-shirt"}],
  443 + },
  444 + )
  445 +
  446 + assert response.status_code == 200
  447 + data = response.json()
  448 + assert seen == {
  449 + "tenant_id": "162",
  450 + "enrichment_scopes": ["category_taxonomy"],
  451 + "category_taxonomy_profile": "apparel",
  452 + }
  453 + assert data["enrichment_scopes"] == ["category_taxonomy"]
  454 + assert data["category_taxonomy_profile"] == "apparel"
  455 +
  456 +
  457 +def test_indexer_enrich_content_contract_supports_non_apparel_taxonomy_profiles(indexer_client: TestClient, monkeypatch):
  458 + import indexer.product_enrich as process_products
  459 +
  460 + def _fake_build_index_content_fields(
  461 + items: List[Dict[str, str]],
  462 + tenant_id: str | None = None,
  463 + enrichment_scopes: List[str] | None = None,
  464 + category_taxonomy_profile: str = "apparel",
  465 + ):
  466 + assert tenant_id == "162"
  467 + assert enrichment_scopes == ["category_taxonomy"]
  468 + assert category_taxonomy_profile == "toys"
  469 + return [
  470 + {
  471 + "id": items[0]["spu_id"],
  472 + "qanchors": {},
  473 + "enriched_tags": {},
  474 + "enriched_attributes": [],
  475 + "enriched_taxonomy_attributes": [
  476 + {"name": "Product Type", "value": {"en": ["doll set"]}},
  477 + {"name": "Age Group", "value": {"en": ["kids"]}},
  478 + ],
  479 + }
  480 + ]
  481 +
  482 + monkeypatch.setattr(process_products, "build_index_content_fields", _fake_build_index_content_fields)
  483 +
  484 + response = indexer_client.post(
  485 + "/indexer/enrich-content",
  486 + json={
  487 + "tenant_id": "162",
  488 + "enrichment_scopes": ["category_taxonomy"],
  489 + "category_taxonomy_profile": "toys",
  490 + "items": [{"spu_id": "1001", "title": "Toy"}],
  491 + },
  492 + )
  493 +
  494 + assert response.status_code == 200
  495 + data = response.json()
  496 + assert data["category_taxonomy_profile"] == "toys"
  497 + assert data["results"][0]["enriched_taxonomy_attributes"] == [
  498 + {"name": "Product Type", "value": {"en": ["doll set"]}},
  499 + {"name": "Age Group", "value": {"en": ["kids"]}},
  500 + ]
391 501  
392 502  
393 503 def test_indexer_documents_contract(indexer_client: TestClient):
... ...
tests/manual/README.md 0 → 100644
... ... @@ -0,0 +1,5 @@
  1 +# Manual Tests
  2 +
  3 +`tests/manual/` 存放需要人工启动依赖服务、手动观察结果或依赖真实外部环境的试跑脚本。
  4 +
  5 +这类脚本不属于 `pytest` 自动回归范围,也不应与 `tests/ci` 的契约测试混为一类。
... ...
scripts/test_build_docs_api.py renamed to tests/manual/test_build_docs_api.py
... ... @@ -4,9 +4,9 @@
4 4  
5 5 用法:
6 6 1. 先启动 Indexer 服务: ./scripts/start_indexer.sh (或 uvicorn api.indexer_app:app --port 6004)
7   - 2. 执行: python scripts/test_build_docs_api.py
  7 + 2. 执行: python tests/manual/test_build_docs_api.py
8 8  
9   - 也可指定地址: INDEXER_URL=http://localhost:6004 python scripts/test_build_docs_api.py
  9 + 也可指定地址: INDEXER_URL=http://localhost:6004 python tests/manual/test_build_docs_api.py
10 10 """
11 11  
12 12 import json
... ... @@ -15,7 +15,7 @@ import sys
15 15 from datetime import datetime, timezone
16 16  
17 17 # 项目根目录
18   -ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
  18 +ROOT = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
19 19 sys.path.insert(0, ROOT)
20 20  
21 21 # 默认使用 requests 调真实服务;若未安装则回退到 TestClient
... ... @@ -122,7 +122,7 @@ def main():
122 122 print("\n[错误] 无法连接 Indexer 服务:", e)
123 123 print("请先启动: ./scripts/start_indexer.sh 或 uvicorn api.indexer_app:app --port 6004")
124 124 if HAS_REQUESTS:
125   - print("或使用进程内测试: USE_TEST_CLIENT=1 python scripts/test_build_docs_api.py")
  125 + print("或使用进程内测试: USE_TEST_CLIENT=1 python tests/manual/test_build_docs_api.py")
126 126 sys.exit(1)
127 127 else:
128 128 if not use_http and not HAS_REQUESTS:
... ...
tests/test_embedding_pipeline.py
  1 +from dataclasses import asdict
1 2 from typing import Any, Dict, List, Optional
2 3  
3 4 import numpy as np
... ...
tests/test_es_query_builder.py
... ... @@ -208,3 +208,36 @@ def test_image_knn_clause_is_added_alongside_base_translation_and_text_knn():
208 208 assert image_knn["path"] == "image_embedding"
209 209 assert image_knn["score_mode"] == "max"
210 210 assert image_knn["query"]["knn"]["field"] == "image_embedding.vector"
  211 +
  212 +
  213 +def test_text_knn_plan_is_reused_for_ann_and_exact_rescore():
  214 + qb = _builder()
  215 + parsed_query = SimpleNamespace(query_tokens=["a", "b", "c", "d", "e"])
  216 +
  217 + ann_clause = qb.build_text_knn_clause(
  218 + np.array([0.1, 0.2, 0.3]),
  219 + parsed_query=parsed_query,
  220 + )
  221 + exact_clause = qb.build_exact_text_knn_rescore_clause(
  222 + np.array([0.1, 0.2, 0.3]),
  223 + parsed_query=parsed_query,
  224 + )
  225 +
  226 + assert ann_clause is not None
  227 + assert exact_clause is not None
  228 + assert ann_clause["knn"]["k"] == qb.knn_text_k_long
  229 + assert ann_clause["knn"]["num_candidates"] == qb.knn_text_num_candidates_long
  230 + assert ann_clause["knn"]["boost"] == qb.knn_text_boost * 1.4
  231 + assert exact_clause["script_score"]["script"]["params"]["boost"] == qb.knn_text_boost * 1.4
  232 +
  233 +
  234 +def test_image_knn_plan_is_reused_for_ann_and_exact_rescore():
  235 + qb = _builder()
  236 +
  237 + ann_clause = qb.build_image_knn_clause(np.array([0.4, 0.5, 0.6]))
  238 + exact_clause = qb.build_exact_image_knn_rescore_clause(np.array([0.4, 0.5, 0.6]))
  239 +
  240 + assert ann_clause is not None
  241 + assert exact_clause is not None
  242 + assert ann_clause["nested"]["query"]["knn"]["boost"] == qb.knn_image_boost
  243 + assert exact_clause["nested"]["query"]["script_score"]["script"]["params"]["boost"] == qb.knn_image_boost
... ...
tests/test_llm_enrichment_batch_fill.py
... ... @@ -10,8 +10,14 @@ from indexer.document_transformer import SPUDocumentTransformer
10 10 def test_fill_llm_attributes_batch_uses_product_enrich_helper(monkeypatch):
11 11 seen_calls: List[Dict[str, Any]] = []
12 12  
13   - def _fake_build_index_content_fields(items, tenant_id=None):
14   - seen_calls.append({"n": len(items), "tenant_id": tenant_id})
  13 + def _fake_build_index_content_fields(items, tenant_id=None, category_taxonomy_profile=None):
  14 + seen_calls.append(
  15 + {
  16 + "n": len(items),
  17 + "tenant_id": tenant_id,
  18 + "category_taxonomy_profile": category_taxonomy_profile,
  19 + }
  20 + )
15 21 return [
16 22 {
17 23 "id": item["id"],
... ... @@ -19,10 +25,13 @@ def test_fill_llm_attributes_batch_uses_product_enrich_helper(monkeypatch):
19 25 "zh": [f"zh-anchor-{item['id']}"],
20 26 "en": [f"en-anchor-{item['id']}"],
21 27 },
22   - "tags": {"zh": ["t1", "t2"], "en": ["t1", "t2"]},
  28 + "enriched_tags": {"zh": ["t1", "t2"], "en": ["t1", "t2"]},
23 29 "enriched_attributes": [
24 30 {"name": "tags", "value": {"zh": ["t1"], "en": ["t1"]}},
25 31 ],
  32 + "enriched_taxonomy_attributes": [
  33 + {"name": "Product Type", "value": {"zh": ["连衣裙"], "en": ["dress"]}},
  34 + ],
26 35 }
27 36 for item in items
28 37 ]
... ... @@ -50,10 +59,14 @@ def test_fill_llm_attributes_batch_uses_product_enrich_helper(monkeypatch):
50 59  
51 60 transformer.fill_llm_attributes_batch(docs, rows)
52 61  
53   - assert seen_calls == [{"n": 45, "tenant_id": "162"}]
  62 + assert seen_calls == [{"n": 45, "tenant_id": "162", "category_taxonomy_profile": "apparel"}]
54 63  
55 64 assert docs[0]["qanchors"]["zh"] == ["zh-anchor-0"]
56 65 assert docs[0]["qanchors"]["en"] == ["en-anchor-0"]
57   - assert docs[0]["tags"]["zh"] == ["t1", "t2"]
58   - assert docs[0]["tags"]["en"] == ["t1", "t2"]
  66 + assert docs[0]["enriched_tags"]["zh"] == ["t1", "t2"]
  67 + assert docs[0]["enriched_tags"]["en"] == ["t1", "t2"]
59 68 assert {"name": "tags", "value": {"zh": ["t1"], "en": ["t1"]}} in docs[0]["enriched_attributes"]
  69 + assert {
  70 + "name": "Product Type",
  71 + "value": {"zh": ["连衣裙"], "en": ["dress"]},
  72 + } in docs[0]["enriched_taxonomy_attributes"]
... ...
tests/test_process_products_batching.py
... ... @@ -13,7 +13,15 @@ def test_analyze_products_caps_batch_size_to_20(monkeypatch):
13 13 monkeypatch.setattr(process_products, "API_KEY", "fake-key")
14 14 seen_batch_sizes: List[int] = []
15 15  
16   - def _fake_process_batch(batch_data: List[Dict[str, str]], batch_num: int, target_lang: str = "zh"):
  16 + def _fake_process_batch(
  17 + batch_data: List[Dict[str, str]],
  18 + batch_num: int,
  19 + target_lang: str = "zh",
  20 + analysis_kind: str = "content",
  21 + category_taxonomy_profile=None,
  22 + ):
  23 + assert analysis_kind == "content"
  24 + assert category_taxonomy_profile is None
17 25 seen_batch_sizes.append(len(batch_data))
18 26 return [
19 27 {
... ... @@ -35,7 +43,7 @@ def test_analyze_products_caps_batch_size_to_20(monkeypatch):
35 43 ]
36 44  
37 45 monkeypatch.setattr(process_products, "process_batch", _fake_process_batch)
38   - monkeypatch.setattr(process_products, "_set_cached_anchor_result", lambda *args, **kwargs: None)
  46 + monkeypatch.setattr(process_products, "_set_cached_analysis_result", lambda *args, **kwargs: None)
39 47  
40 48 out = process_products.analyze_products(
41 49 products=_mk_products(45),
... ... @@ -53,7 +61,15 @@ def test_analyze_products_uses_min_batch_size_1(monkeypatch):
53 61 monkeypatch.setattr(process_products, "API_KEY", "fake-key")
54 62 seen_batch_sizes: List[int] = []
55 63  
56   - def _fake_process_batch(batch_data: List[Dict[str, str]], batch_num: int, target_lang: str = "zh"):
  64 + def _fake_process_batch(
  65 + batch_data: List[Dict[str, str]],
  66 + batch_num: int,
  67 + target_lang: str = "zh",
  68 + analysis_kind: str = "content",
  69 + category_taxonomy_profile=None,
  70 + ):
  71 + assert analysis_kind == "content"
  72 + assert category_taxonomy_profile is None
57 73 seen_batch_sizes.append(len(batch_data))
58 74 return [
59 75 {
... ... @@ -75,7 +91,7 @@ def test_analyze_products_uses_min_batch_size_1(monkeypatch):
75 91 ]
76 92  
77 93 monkeypatch.setattr(process_products, "process_batch", _fake_process_batch)
78   - monkeypatch.setattr(process_products, "_set_cached_anchor_result", lambda *args, **kwargs: None)
  94 + monkeypatch.setattr(process_products, "_set_cached_analysis_result", lambda *args, **kwargs: None)
79 95  
80 96 out = process_products.analyze_products(
81 97 products=_mk_products(3),
... ...
tests/test_product_enrich_partial_mode.py
... ... @@ -74,6 +74,28 @@ def test_create_prompt_splits_shared_context_and_localized_tail():
74 74 assert prefix_en.startswith("| No. | Product title | Category path |")
75 75  
76 76  
  77 +def test_create_prompt_supports_taxonomy_analysis_kind():
  78 + products = [{"id": "1", "title": "linen dress"}]
  79 +
  80 + shared_zh, user_zh, prefix_zh = product_enrich.create_prompt(
  81 + products,
  82 + target_lang="zh",
  83 + analysis_kind="taxonomy",
  84 + )
  85 + shared_fr, user_fr, prefix_fr = product_enrich.create_prompt(
  86 + products,
  87 + target_lang="fr",
  88 + analysis_kind="taxonomy",
  89 + )
  90 +
  91 + assert "apparel attribute taxonomy" in shared_zh
  92 + assert "1. linen dress" in shared_zh
  93 + assert "Language: Chinese" in user_zh
  94 + assert "Language: French" in user_fr
  95 + assert prefix_zh.startswith("| 序号 | 品类 | 目标性别 |")
  96 + assert prefix_fr.startswith("| No. | Product Type | Target Gender |")
  97 +
  98 +
77 99 def test_call_llm_logs_shared_context_once_and_verbose_contains_full_requests():
78 100 payloads = []
79 101 response_bodies = [
... ... @@ -228,6 +250,38 @@ def test_process_batch_reads_result_and_validates_expected_fields():
228 250 assert row["anchor_text"] == "法式收腰连衣裙"
229 251  
230 252  
  253 +def test_process_batch_reads_taxonomy_result_with_schema_specific_fields():
  254 + merged_markdown = """| 序号 | 品类 | 目标性别 | 年龄段 | 适用季节 | 版型 | 廓形 | 领型 | 袖长类型 | 袖型 | 肩带设计 | 腰型 | 裤型 | 裙型 | 长度类型 | 闭合方式 | 设计细节 | 面料 | 成分 | 面料特性 | 服装特征 | 功能 | 主颜色 | 色系 | 印花 / 图案 | 适用场景 | 风格 |
  255 +|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
  256 +| 1 | 连衣裙 | 女 | 成人 | 春季,夏季 | 修身 | A字 | V领 | 无袖 | | 细肩带 | 高腰 | | A字裙 | 中长款 | 拉链 | 褶皱 | 梭织 | 聚酯纤维,氨纶 | 轻薄,透气 | 有内衬 | 易打理 | 酒红色 | 红色 | 纯色 | 约会,度假 | 浪漫 |
  257 +"""
  258 +
  259 + with mock.patch.object(
  260 + product_enrich,
  261 + "call_llm",
  262 + return_value=(merged_markdown, json.dumps({"choices": [{"message": {"content": "stub"}}]})),
  263 + ):
  264 + results = product_enrich.process_batch(
  265 + [{"id": "sku-1", "title": "dress"}],
  266 + batch_num=1,
  267 + target_lang="zh",
  268 + analysis_kind="taxonomy",
  269 + )
  270 +
  271 + assert len(results) == 1
  272 + row = results[0]
  273 + assert row["id"] == "sku-1"
  274 + assert row["lang"] == "zh"
  275 + assert row["title_input"] == "dress"
  276 + assert row["product_type"] == "连衣裙"
  277 + assert row["target_gender"] == "女"
  278 + assert row["age_group"] == "成人"
  279 + assert row["sleeve_length_type"] == "无袖"
  280 + assert row["material_composition"] == "聚酯纤维,氨纶"
  281 + assert row["occasion_end_use"] == "约会,度假"
  282 + assert row["style_aesthetic"] == "浪漫"
  283 +
  284 +
231 285 def test_analyze_products_uses_product_level_cache_across_batch_requests():
232 286 cache_store = {}
233 287 process_calls = []
... ... @@ -241,13 +295,36 @@ def test_analyze_products_uses_product_level_cache_across_batch_requests():
241 295 product.get("image_url", ""),
242 296 )
243 297  
244   - def fake_get_cached_anchor_result(product, target_lang):
  298 + def fake_get_cached_analysis_result(
  299 + product,
  300 + target_lang,
  301 + analysis_kind="content",
  302 + category_taxonomy_profile=None,
  303 + ):
  304 + assert analysis_kind == "content"
  305 + assert category_taxonomy_profile is None
245 306 return cache_store.get(_cache_key(product, target_lang))
246 307  
247   - def fake_set_cached_anchor_result(product, target_lang, result):
  308 + def fake_set_cached_analysis_result(
  309 + product,
  310 + target_lang,
  311 + result,
  312 + analysis_kind="content",
  313 + category_taxonomy_profile=None,
  314 + ):
  315 + assert analysis_kind == "content"
  316 + assert category_taxonomy_profile is None
248 317 cache_store[_cache_key(product, target_lang)] = result
249 318  
250   - def fake_process_batch(batch_data, batch_num, target_lang="zh"):
  319 + def fake_process_batch(
  320 + batch_data,
  321 + batch_num,
  322 + target_lang="zh",
  323 + analysis_kind="content",
  324 + category_taxonomy_profile=None,
  325 + ):
  326 + assert analysis_kind == "content"
  327 + assert category_taxonomy_profile is None
251 328 process_calls.append(
252 329 {
253 330 "batch_num": batch_num,
... ... @@ -281,12 +358,12 @@ def test_analyze_products_uses_product_level_cache_across_batch_requests():
281 358  
282 359 with mock.patch.object(product_enrich, "API_KEY", "fake-key"), mock.patch.object(
283 360 product_enrich,
284   - "_get_cached_anchor_result",
285   - side_effect=fake_get_cached_anchor_result,
  361 + "_get_cached_analysis_result",
  362 + side_effect=fake_get_cached_analysis_result,
286 363 ), mock.patch.object(
287 364 product_enrich,
288   - "_set_cached_anchor_result",
289   - side_effect=fake_set_cached_anchor_result,
  365 + "_set_cached_analysis_result",
  366 + side_effect=fake_set_cached_analysis_result,
290 367 ), mock.patch.object(
291 368 product_enrich,
292 369 "process_batch",
... ... @@ -342,11 +419,12 @@ def test_analyze_products_reuses_cached_content_with_current_product_identity():
342 419  
343 420 with mock.patch.object(product_enrich, "API_KEY", "fake-key"), mock.patch.object(
344 421 product_enrich,
345   - "_get_cached_anchor_result",
346   - wraps=lambda product, target_lang: product_enrich._normalize_analysis_result(
  422 + "_get_cached_analysis_result",
  423 + wraps=lambda product, target_lang, analysis_kind="content", category_taxonomy_profile=None: product_enrich._normalize_analysis_result(
347 424 cached_result,
348 425 product=product,
349 426 target_lang=target_lang,
  427 + schema=product_enrich._get_analysis_schema("content"),
350 428 ),
351 429 ), mock.patch.object(
352 430 product_enrich,
... ... @@ -379,7 +457,49 @@ def test_analyze_products_reuses_cached_content_with_current_product_identity():
379 457  
380 458  
381 459 def test_build_index_content_fields_maps_internal_tags_to_enriched_tags_output():
382   - def fake_analyze_products(products, target_lang="zh", batch_size=None, tenant_id=None):
  460 + def fake_analyze_products(
  461 + products,
  462 + target_lang="zh",
  463 + batch_size=None,
  464 + tenant_id=None,
  465 + analysis_kind="content",
  466 + category_taxonomy_profile=None,
  467 + ):
  468 + if analysis_kind == "taxonomy":
  469 + assert category_taxonomy_profile == "apparel"
  470 + return [
  471 + {
  472 + "id": products[0]["id"],
  473 + "lang": target_lang,
  474 + "title_input": products[0]["title"],
  475 + "product_type": f"{target_lang}-dress",
  476 + "target_gender": f"{target_lang}-women",
  477 + "age_group": "",
  478 + "season": f"{target_lang}-summer",
  479 + "fit": "",
  480 + "silhouette": "",
  481 + "neckline": "",
  482 + "sleeve_length_type": "",
  483 + "sleeve_style": "",
  484 + "strap_type": "",
  485 + "rise_waistline": "",
  486 + "leg_shape": "",
  487 + "skirt_shape": "",
  488 + "length_type": "",
  489 + "closure_type": "",
  490 + "design_details": "",
  491 + "fabric": "",
  492 + "material_composition": "",
  493 + "fabric_properties": "",
  494 + "clothing_features": "",
  495 + "functional_benefits": "",
  496 + "color": "",
  497 + "color_family": "",
  498 + "print_pattern": "",
  499 + "occasion_end_use": "",
  500 + "style_aesthetic": "",
  501 + }
  502 + ]
383 503 return [
384 504 {
385 505 "id": products[0]["id"],
... ... @@ -423,8 +543,103 @@ def test_build_index_content_fields_maps_internal_tags_to_enriched_tags_output()
423 543 },
424 544 {"name": "target_audience", "value": {"zh": ["zh-audience"], "en": ["en-audience"]}},
425 545 ],
  546 + "enriched_taxonomy_attributes": [
  547 + {
  548 + "name": "Product Type",
  549 + "value": {"zh": ["zh-dress"], "en": ["en-dress"]},
  550 + },
  551 + {
  552 + "name": "Target Gender",
  553 + "value": {"zh": ["zh-women"], "en": ["en-women"]},
  554 + },
  555 + {
  556 + "name": "Season",
  557 + "value": {"zh": ["zh-summer"], "en": ["en-summer"]},
  558 + },
  559 + ],
426 560 }
427 561 ]
  562 +def test_build_index_content_fields_non_apparel_taxonomy_returns_en_only():
  563 + seen_calls = []
  564 +
  565 + def fake_analyze_products(
  566 + products,
  567 + target_lang="zh",
  568 + batch_size=None,
  569 + tenant_id=None,
  570 + analysis_kind="content",
  571 + category_taxonomy_profile=None,
  572 + ):
  573 + seen_calls.append((analysis_kind, target_lang, category_taxonomy_profile, tuple(p["id"] for p in products)))
  574 + if analysis_kind == "taxonomy":
  575 + assert category_taxonomy_profile == "toys"
  576 + assert target_lang == "en"
  577 + return [
  578 + {
  579 + "id": products[0]["id"],
  580 + "lang": "en",
  581 + "title_input": products[0]["title"],
  582 + "product_type": "doll set",
  583 + "age_group": "kids",
  584 + "character_theme": "",
  585 + "material": "",
  586 + "power_source": "",
  587 + "interactive_features": "",
  588 + "educational_play_value": "",
  589 + "piece_count_size": "",
  590 + "color": "",
  591 + "use_scenario": "",
  592 + }
  593 + ]
  594 +
  595 + return [
  596 + {
  597 + "id": product["id"],
  598 + "lang": target_lang,
  599 + "title_input": product["title"],
  600 + "title": product["title"],
  601 + "category_path": "",
  602 + "tags": f"{target_lang}-tag",
  603 + "target_audience": "",
  604 + "usage_scene": "",
  605 + "season": "",
  606 + "key_attributes": "",
  607 + "material": "",
  608 + "features": "",
  609 + "anchor_text": f"{target_lang}-anchor",
  610 + }
  611 + for product in products
  612 + ]
  613 +
  614 + with mock.patch.object(product_enrich, "analyze_products", side_effect=fake_analyze_products):
  615 + result = product_enrich.build_index_content_fields(
  616 + items=[{"spu_id": "2", "title": "toy"}],
  617 + tenant_id="170",
  618 + category_taxonomy_profile="toys",
  619 + )
  620 +
  621 + assert result == [
  622 + {
  623 + "id": "2",
  624 + "qanchors": {"zh": ["zh-anchor"], "en": ["en-anchor"]},
  625 + "enriched_tags": {"zh": ["zh-tag"], "en": ["en-tag"]},
  626 + "enriched_attributes": [
  627 + {
  628 + "name": "enriched_tags",
  629 + "value": {
  630 + "zh": ["zh-tag"],
  631 + "en": ["en-tag"],
  632 + },
  633 + }
  634 + ],
  635 + "enriched_taxonomy_attributes": [
  636 + {"name": "Product Type", "value": {"en": ["doll set"]}},
  637 + {"name": "Age Group", "value": {"en": ["kids"]}},
  638 + ],
  639 + }
  640 + ]
  641 + assert ("taxonomy", "zh", "toys", ("2",)) not in seen_calls
  642 + assert ("taxonomy", "en", "toys", ("2",)) in seen_calls
428 643  
429 644  
430 645 def test_anchor_cache_key_depends_on_product_input_not_identifiers():
... ... @@ -461,6 +676,40 @@ def test_anchor_cache_key_depends_on_product_input_not_identifiers():
461 676 assert key_a != key_c
462 677  
463 678  
  679 +def test_analysis_cache_key_isolated_by_analysis_kind():
  680 + product = {
  681 + "id": "1",
  682 + "title": "dress",
  683 + "brief": "soft cotton",
  684 + "description": "summer dress",
  685 + }
  686 +
  687 + content_key = product_enrich._make_analysis_cache_key(product, "zh", "content")
  688 + taxonomy_key = product_enrich._make_analysis_cache_key(product, "zh", "taxonomy")
  689 +
  690 + assert content_key != taxonomy_key
  691 +
  692 +
  693 +def test_analysis_cache_key_changes_when_prompt_contract_changes():
  694 + product = {
  695 + "id": "1",
  696 + "title": "dress",
  697 + "brief": "soft cotton",
  698 + "description": "summer dress",
  699 + }
  700 +
  701 + original_key = product_enrich._make_analysis_cache_key(product, "zh", "taxonomy")
  702 +
  703 + with mock.patch.object(
  704 + product_enrich,
  705 + "USER_INSTRUCTION_TEMPLATE",
  706 + "Please return JSON only. Language: {language}",
  707 + ):
  708 + changed_key = product_enrich._make_analysis_cache_key(product, "zh", "taxonomy")
  709 +
  710 + assert original_key != changed_key
  711 +
  712 +
464 713 def test_build_prompt_input_text_appends_brief_and_description_for_short_title():
465 714 product = {
466 715 "title": "T恤",
... ...
tests/test_rerank_client.py
1 1 from math import isclose
2 2  
3   -from config.schema import RerankFusionConfig
4   -from search.rerank_client import fuse_scores_and_resort, run_lightweight_rerank
  3 +from config.schema import CoarseRankFusionConfig, RerankFusionConfig
  4 +from search.rerank_client import coarse_resort_hits, fuse_scores_and_resort, run_lightweight_rerank
5 5  
6 6  
7 7 def test_fuse_scores_and_resort_aggregates_text_components_and_keeps_rerank_primary():
... ... @@ -172,6 +172,57 @@ def test_fuse_scores_and_resort_uses_max_of_text_and_image_knn_scores():
172 172 assert isclose(debug[0]["image_knn_score"], 0.7, rel_tol=1e-9)
173 173  
174 174  
  175 +def test_fuse_scores_and_resort_prefers_exact_knn_scores_over_ann_scores():
  176 + hits = [
  177 + {
  178 + "_id": "exact-mm-hit",
  179 + "_score": 1.0,
  180 + "matched_queries": {
  181 + "base_query": 1.5,
  182 + "knn_query": 0.2,
  183 + "image_knn_query": 0.7,
  184 + "exact_text_knn_query": 0.9,
  185 + "exact_image_knn_query": 0.1,
  186 + },
  187 + }
  188 + ]
  189 +
  190 + debug = fuse_scores_and_resort(hits, [0.8], debug=True)
  191 +
  192 + assert isclose(hits[0]["_knn_score"], 0.9, rel_tol=1e-9)
  193 + assert isclose(debug[0]["text_knn_score"], 0.9, rel_tol=1e-9)
  194 + assert isclose(debug[0]["image_knn_score"], 0.1, rel_tol=1e-9)
  195 + assert isclose(debug[0]["exact_text_knn_score"], 0.9, rel_tol=1e-9)
  196 + assert isclose(debug[0]["exact_image_knn_score"], 0.1, rel_tol=1e-9)
  197 + assert isclose(debug[0]["approx_text_knn_score"], 0.2, rel_tol=1e-9)
  198 + assert isclose(debug[0]["approx_image_knn_score"], 0.7, rel_tol=1e-9)
  199 + assert debug[0]["text_knn_source"] == "exact_text_knn_query"
  200 + assert debug[0]["image_knn_source"] == "exact_image_knn_query"
  201 +
  202 +
  203 +def test_fuse_scores_and_resort_falls_back_to_ann_when_exact_knn_missing():
  204 + hits = [
  205 + {
  206 + "_id": "ann-only-hit",
  207 + "_score": 1.0,
  208 + "matched_queries": {
  209 + "base_query": 1.5,
  210 + "knn_query": 0.4,
  211 + "image_knn_query": 0.5,
  212 + },
  213 + }
  214 + ]
  215 +
  216 + debug = fuse_scores_and_resort(hits, [0.8], debug=True)
  217 +
  218 + assert isclose(debug[0]["text_knn_score"], 0.4, rel_tol=1e-9)
  219 + assert isclose(debug[0]["image_knn_score"], 0.5, rel_tol=1e-9)
  220 + assert isclose(debug[0]["approx_text_knn_score"], 0.4, rel_tol=1e-9)
  221 + assert isclose(debug[0]["approx_image_knn_score"], 0.5, rel_tol=1e-9)
  222 + assert debug[0]["text_knn_source"] == "knn_query"
  223 + assert debug[0]["image_knn_source"] == "image_knn_query"
  224 +
  225 +
175 226 def test_fuse_scores_and_resort_applies_knn_dismax_weights_and_tie_breaker():
176 227 hits = [
177 228 {
... ... @@ -206,6 +257,96 @@ def test_fuse_scores_and_resort_applies_knn_dismax_weights_and_tie_breaker():
206 257 assert isclose(debug[0]["knn_support_score"], 0.5, rel_tol=1e-9)
207 258  
208 259  
  260 +def test_fuse_scores_and_resort_can_add_weighted_text_and_image_knn_factors():
  261 + hits = [
  262 + {
  263 + "_id": "a",
  264 + "_score": 1.0,
  265 + "matched_queries": {
  266 + "base_query": 2.0,
  267 + "knn_query": 0.4,
  268 + "image_knn_query": 0.5,
  269 + },
  270 + }
  271 + ]
  272 + fusion = RerankFusionConfig(
  273 + rerank_bias=0.0,
  274 + rerank_exponent=1.0,
  275 + text_bias=0.0,
  276 + text_exponent=1.0,
  277 + knn_text_weight=2.0,
  278 + knn_image_weight=1.0,
  279 + knn_tie_breaker=0.25,
  280 + knn_bias=0.1,
  281 + knn_exponent=1.0,
  282 + knn_text_exponent=2.0,
  283 + knn_image_exponent=3.0,
  284 + )
  285 +
  286 + debug = fuse_scores_and_resort(hits, [0.8], fusion=fusion, debug=True)
  287 +
  288 + weighted_text_knn = 0.8
  289 + weighted_image_knn = 0.5
  290 + expected_knn = weighted_text_knn + 0.25 * weighted_image_knn
  291 + expected_fused = (
  292 + 0.8
  293 + * 2.0
  294 + * (expected_knn + 0.1)
  295 + * ((weighted_text_knn + 0.1) ** 2.0)
  296 + * ((weighted_image_knn + 0.1) ** 3.0)
  297 + )
  298 +
  299 + assert isclose(hits[0]["_fused_score"], expected_fused, rel_tol=1e-9)
  300 + assert isclose(debug[0]["text_knn_factor"], (weighted_text_knn + 0.1) ** 2.0, rel_tol=1e-9)
  301 + assert isclose(debug[0]["image_knn_factor"], (weighted_image_knn + 0.1) ** 3.0, rel_tol=1e-9)
  302 + assert "weighted_text_knn_score=" in debug[0]["fusion_summary"]
  303 + assert "weighted_image_knn_score=" in debug[0]["fusion_summary"]
  304 +
  305 +
  306 +def test_coarse_resort_hits_can_add_weighted_text_and_image_knn_factors():
  307 + hits = [
  308 + {
  309 + "_id": "coarse-a",
  310 + "_score": 1.0,
  311 + "matched_queries": {
  312 + "base_query": 2.0,
  313 + "knn_query": 0.4,
  314 + "image_knn_query": 0.5,
  315 + },
  316 + }
  317 + ]
  318 + fusion = CoarseRankFusionConfig(
  319 + es_bias=0.0,
  320 + es_exponent=1.0,
  321 + text_bias=0.0,
  322 + text_exponent=1.0,
  323 + knn_text_weight=2.0,
  324 + knn_image_weight=1.0,
  325 + knn_tie_breaker=0.25,
  326 + knn_bias=0.1,
  327 + knn_exponent=1.0,
  328 + knn_text_exponent=2.0,
  329 + knn_image_exponent=3.0,
  330 + )
  331 +
  332 + debug = coarse_resort_hits(hits, fusion=fusion, debug=True)
  333 +
  334 + weighted_text_knn = 0.8
  335 + weighted_image_knn = 0.5
  336 + expected_knn = weighted_text_knn + 0.25 * weighted_image_knn
  337 + expected_coarse = (
  338 + 1.0
  339 + * 2.0
  340 + * (expected_knn + 0.1)
  341 + * ((weighted_text_knn + 0.1) ** 2.0)
  342 + * ((weighted_image_knn + 0.1) ** 3.0)
  343 + )
  344 +
  345 + assert isclose(hits[0]["_coarse_score"], expected_coarse, rel_tol=1e-9)
  346 + assert isclose(debug[0]["coarse_text_knn_factor"], (weighted_text_knn + 0.1) ** 2.0, rel_tol=1e-9)
  347 + assert isclose(debug[0]["coarse_image_knn_factor"], (weighted_image_knn + 0.1) ** 3.0, rel_tol=1e-9)
  348 +
  349 +
209 350 def test_run_lightweight_rerank_sorts_by_fused_stage_score(monkeypatch):
210 351 hits = [
211 352 {
... ...
tests/test_search_rerank_window.py
1 1 from __future__ import annotations
2 2  
3   -from dataclasses import dataclass
  3 +from dataclasses import dataclass, field
4 4 from pathlib import Path
5 5 from types import SimpleNamespace
6 6 from typing import Any, Dict, List
... ... @@ -30,7 +30,10 @@ class _FakeParsedQuery:
30 30 rewritten_query: str
31 31 detected_language: str = "en"
32 32 translations: Dict[str, str] = None
  33 + keywords_queries: Dict[str, str] = field(default_factory=dict)
33 34 query_vector: Any = None
  35 + image_query_vector: Any = None
  36 + query_tokens: List[str] = field(default_factory=list)
34 37 style_intent_profile: Any = None
35 38  
36 39 def text_for_rerank(self) -> str:
... ... @@ -89,6 +92,15 @@ class _FakeQueryParser:
89 92  
90 93  
91 94 class _FakeQueryBuilder:
  95 + knn_text_k = 120
  96 + knn_text_k_long = 160
  97 + knn_text_num_candidates = 400
  98 + knn_text_num_candidates_long = 500
  99 + knn_text_boost = 20.0
  100 + knn_image_k = 120
  101 + knn_image_num_candidates = 400
  102 + knn_image_boost = 20.0
  103 +
92 104 def build_query(self, **kwargs):
93 105 return {
94 106 "query": {"match_all": {}},
... ... @@ -185,13 +197,24 @@ class _FakeESClient:
185 197 }
186 198  
187 199  
188   -def _build_search_config(*, rerank_enabled: bool = True, rerank_window: int = 384):
  200 +def _build_search_config(
  201 + *,
  202 + rerank_enabled: bool = True,
  203 + rerank_window: int = 384,
  204 + exact_knn_rescore_enabled: bool = False,
  205 + exact_knn_rescore_window: int = 0,
  206 +):
189 207 return SearchConfig(
190 208 field_boosts={"title.en": 3.0},
191 209 indexes=[IndexConfig(name="default", label="default", fields=["title.en"])],
192 210 query_config=QueryConfig(enable_text_embedding=False, enable_query_rewrite=False),
193 211 function_score=FunctionScoreConfig(),
194   - rerank=RerankConfig(enabled=rerank_enabled, rerank_window=rerank_window),
  212 + rerank=RerankConfig(
  213 + enabled=rerank_enabled,
  214 + rerank_window=rerank_window,
  215 + exact_knn_rescore_enabled=exact_knn_rescore_enabled,
  216 + exact_knn_rescore_window=exact_knn_rescore_window,
  217 + ),
195 218 spu_config=SPUConfig(enabled=False),
196 219 es_index_name="test_products",
197 220 es_settings={},
... ... @@ -289,7 +312,11 @@ def test_config_loader_rerank_enabled_defaults_true(tmp_path: Path):
289 312 },
290 313 "spu_config": {"enabled": False},
291 314 "function_score": {"score_mode": "sum", "boost_mode": "multiply", "functions": []},
292   - "rerank": {"rerank_window": 384},
  315 + "rerank": {
  316 + "rerank_window": 384,
  317 + "exact_knn_rescore_enabled": True,
  318 + "exact_knn_rescore_window": 160,
  319 + },
293 320 }
294 321 config_path = tmp_path / "config.yaml"
295 322 config_path.write_text(yaml.safe_dump(config_data), encoding="utf-8")
... ... @@ -298,6 +325,8 @@ def test_config_loader_rerank_enabled_defaults_true(tmp_path: Path):
298 325 loaded = loader.load_config(validate=False)
299 326  
300 327 assert loaded.rerank.enabled is True
  328 + assert loaded.rerank.exact_knn_rescore_enabled is True
  329 + assert loaded.rerank.exact_knn_rescore_window == 160
301 330  
302 331  
303 332 def test_config_loader_parses_named_rerank_instances(tmp_path: Path):
... ... @@ -583,7 +612,7 @@ def test_searcher_rerank_prefetch_source_includes_sku_fields_when_style_intent_a
583 612 }
584 613  
585 614  
586   -def test_searcher_skips_rerank_when_request_explicitly_false(monkeypatch):
  615 +def test_searcher_keeps_previous_stage_order_when_request_explicitly_disables_rerank(monkeypatch):
587 616 es_client = _FakeESClient()
588 617 searcher = _build_searcher(_build_search_config(rerank_enabled=True), es_client)
589 618 context = create_request_context(reqid="t2", uid="u2")
... ... @@ -593,28 +622,95 @@ def test_searcher_skips_rerank_when_request_explicitly_false(monkeypatch):
593 622 lambda: SimpleNamespace(get_tenant_config=lambda tenant_id: {"index_languages": ["en"]}),
594 623 )
595 624  
596   - called: Dict[str, int] = {"count": 0}
  625 + called: Dict[str, int] = {"count": 0, "fine": 0}
  626 +
  627 + def _fake_run_lightweight_rerank(**kwargs):
  628 + called["fine"] += 1
  629 + hits = kwargs["es_hits"]
  630 + for idx, hit in enumerate(hits):
  631 + hit["_fine_score"] = float(idx + 1)
  632 + hits.reverse()
  633 + return [hit["_fine_score"] for hit in hits], {"stage": "fine"}, []
597 634  
598 635 def _fake_run_rerank(**kwargs):
599 636 called["count"] += 1
600 637 return kwargs["es_response"], None, []
601 638  
  639 + monkeypatch.setattr("search.rerank_client.run_lightweight_rerank", _fake_run_lightweight_rerank)
602 640 monkeypatch.setattr("search.rerank_client.run_rerank", _fake_run_rerank)
603 641  
604   - searcher.search(
  642 + result = searcher.search(
605 643 query="toy",
606 644 tenant_id="162",
607 645 from_=20,
608 646 size=10,
609 647 context=context,
610 648 enable_rerank=False,
  649 + debug=True,
611 650 )
612 651  
613 652 assert called["count"] == 0
614   - assert es_client.calls[0]["from_"] == 20
615   - assert es_client.calls[0]["size"] == 10
616   - assert es_client.calls[0]["include_named_queries_score"] is False
617   - assert len(es_client.calls) == 1
  653 + assert called["fine"] == 1
  654 + assert es_client.calls[0]["from_"] == 0
  655 + assert es_client.calls[0]["size"] == searcher.config.coarse_rank.input_window
  656 + assert es_client.calls[0]["include_named_queries_score"] is True
  657 + assert len(es_client.calls) == 3
  658 + assert es_client.calls[2]["body"]["query"]["ids"]["values"] == [str(i) for i in range(363, 353, -1)]
  659 + assert len(result.results) == 10
  660 + assert [item.spu_id for item in result.results[:3]] == ["363", "362", "361"]
  661 + assert result.debug_info["rerank"]["enabled"] is False
  662 + assert result.debug_info["rerank"]["applied"] is False
  663 + assert result.debug_info["rerank"]["skipped_reason"] == "disabled"
  664 + assert result.debug_info["per_result"][0]["ranking_funnel"]["rerank"]["rank"] == 21
  665 +
  666 +
  667 +def test_searcher_keeps_previous_stage_order_when_config_disables_rerank(monkeypatch):
  668 + es_client = _FakeESClient()
  669 + searcher = _build_searcher(_build_search_config(rerank_enabled=False), es_client)
  670 + context = create_request_context(reqid="t2b", uid="u2b")
  671 +
  672 + monkeypatch.setattr(
  673 + "search.searcher.get_tenant_config_loader",
  674 + lambda: SimpleNamespace(get_tenant_config=lambda tenant_id: {"index_languages": ["en"]}),
  675 + )
  676 +
  677 + called: Dict[str, int] = {"count": 0, "fine": 0}
  678 +
  679 + def _fake_run_lightweight_rerank(**kwargs):
  680 + called["fine"] += 1
  681 + hits = kwargs["es_hits"]
  682 + hits.reverse()
  683 + for idx, hit in enumerate(hits):
  684 + hit["_fine_score"] = float(len(hits) - idx)
  685 + return [hit["_fine_score"] for hit in hits], {"stage": "fine"}, []
  686 +
  687 + def _fake_run_rerank(**kwargs):
  688 + called["count"] += 1
  689 + return kwargs["es_response"], None, []
  690 +
  691 + monkeypatch.setattr("search.rerank_client.run_lightweight_rerank", _fake_run_lightweight_rerank)
  692 + monkeypatch.setattr("search.rerank_client.run_rerank", _fake_run_rerank)
  693 +
  694 + result = searcher.search(
  695 + query="toy",
  696 + tenant_id="162",
  697 + from_=0,
  698 + size=5,
  699 + context=context,
  700 + enable_rerank=None,
  701 + debug=True,
  702 + )
  703 +
  704 + assert called["count"] == 0
  705 + assert called["fine"] == 1
  706 + assert es_client.calls[0]["from_"] == 0
  707 + assert es_client.calls[0]["size"] == searcher.config.coarse_rank.input_window
  708 + assert es_client.calls[0]["include_named_queries_score"] is True
  709 + assert len(result.results) == 5
  710 + assert [item.spu_id for item in result.results] == ["383", "382", "381", "380", "379"]
  711 + assert result.debug_info["rerank"]["enabled"] is False
  712 + assert result.debug_info["rerank"]["applied"] is False
  713 + assert result.debug_info["rerank"]["skipped_reason"] == "disabled"
618 714  
619 715  
620 716 def test_searcher_skips_rerank_when_page_exceeds_window(monkeypatch):
... ... @@ -919,7 +1015,8 @@ def test_searcher_promotes_sku_by_embedding_when_query_has_no_direct_option_matc
919 1015  
920 1016 def test_searcher_debug_info_uses_initial_es_max_score_for_normalization(monkeypatch):
921 1017 es_client = _FakeESClient(total_hits=3)
922   - searcher = _build_searcher(_build_search_config(rerank_enabled=False), es_client)
  1018 + cfg = _build_search_config(rerank_enabled=False)
  1019 + searcher = _build_searcher(cfg, es_client)
923 1020 context = create_request_context(reqid="dbg", uid="u-dbg")
924 1021  
925 1022 monkeypatch.setattr(
... ... @@ -939,7 +1036,8 @@ def test_searcher_debug_info_uses_initial_es_max_score_for_normalization(monkeyp
939 1036  
940 1037 assert result.debug_info["query_analysis"]["index_languages"] == ["en", "zh"]
941 1038 assert result.debug_info["query_analysis"]["query_tokens"] == []
942   - assert result.debug_info["es_query_context"]["es_fetch_size"] == 2
  1039 + expected_es_fetch = max(cfg.rerank.rerank_window, cfg.coarse_rank.input_window)
  1040 + assert result.debug_info["es_query_context"]["es_fetch_size"] == expected_es_fetch
943 1041 assert result.debug_info["es_response"]["es_score_normalization_factor"] == 3.0
944 1042 assert result.debug_info["per_result"][0]["initial_rank"] == 1
945 1043 assert result.debug_info["per_result"][0]["final_rank"] == 1
... ... @@ -947,6 +1045,166 @@ def test_searcher_debug_info_uses_initial_es_max_score_for_normalization(monkeyp
947 1045 assert result.debug_info["per_result"][1]["es_score_normalized"] == 2.0 / 3.0
948 1046  
949 1047  
  1048 +def test_searcher_attaches_exact_knn_rescore_for_rank_window(monkeypatch):
  1049 + class _VectorQueryParser:
  1050 + def parse(self, query: str, tenant_id: str, generate_vector: bool, context: Any, target_languages: Any = None):
  1051 + return _FakeParsedQuery(
  1052 + original_query=query,
  1053 + query_normalized=query,
  1054 + rewritten_query=query,
  1055 + translations={},
  1056 + query_vector=np.array([0.1, 0.2, 0.3], dtype=np.float32),
  1057 + image_query_vector=np.array([0.4, 0.5, 0.6], dtype=np.float32),
  1058 + query_tokens=["dress", "formal", "spring", "summer", "floral"],
  1059 + )
  1060 +
  1061 + es_client = _FakeESClient(total_hits=5)
  1062 + base = _build_search_config(
  1063 + rerank_enabled=True,
  1064 + rerank_window=5,
  1065 + exact_knn_rescore_enabled=True,
  1066 + exact_knn_rescore_window=3,
  1067 + )
  1068 + config = SearchConfig(
  1069 + field_boosts=base.field_boosts,
  1070 + indexes=base.indexes,
  1071 + query_config=QueryConfig(
  1072 + enable_text_embedding=True,
  1073 + enable_query_rewrite=False,
  1074 + text_embedding_field="title_embedding",
  1075 + image_embedding_field="image_embedding.vector",
  1076 + ),
  1077 + function_score=base.function_score,
  1078 + coarse_rank=base.coarse_rank,
  1079 + fine_rank=FineRankConfig(enabled=False, input_window=5, output_window=5),
  1080 + rerank=base.rerank,
  1081 + spu_config=base.spu_config,
  1082 + es_index_name=base.es_index_name,
  1083 + es_settings=base.es_settings,
  1084 + )
  1085 + searcher = Searcher(
  1086 + es_client=es_client,
  1087 + config=config,
  1088 + query_parser=_VectorQueryParser(),
  1089 + image_encoder=SimpleNamespace(),
  1090 + )
  1091 + context = create_request_context(reqid="exact-rescore", uid="u-exact")
  1092 +
  1093 + monkeypatch.setattr(
  1094 + "search.searcher.get_tenant_config_loader",
  1095 + lambda: SimpleNamespace(get_tenant_config=lambda tenant_id: {"index_languages": ["en"]}),
  1096 + )
  1097 +
  1098 + searcher.search(
  1099 + query="dress",
  1100 + tenant_id="162",
  1101 + from_=0,
  1102 + size=2,
  1103 + context=context,
  1104 + enable_rerank=False,
  1105 + debug=True,
  1106 + )
  1107 +
  1108 + body = es_client.calls[0]["body"]
  1109 + assert body["rescore"]["window_size"] == 3
  1110 + assert body["rescore"]["query"]["score_mode"] == "total"
  1111 + assert body["rescore"]["query"]["rescore_query_weight"] == 0.0
  1112 + should = body["rescore"]["query"]["rescore_query"]["bool"]["should"]
  1113 + names = []
  1114 + for clause in should:
  1115 + if "script_score" in clause:
  1116 + names.append(clause["script_score"]["_name"])
  1117 + elif "nested" in clause:
  1118 + names.append(clause["nested"]["_name"])
  1119 + assert names == ["exact_text_knn_query", "exact_image_knn_query"]
  1120 + recall_query = body["query"]
  1121 + if "bool" in recall_query and recall_query["bool"].get("must"):
  1122 + recall_query = recall_query["bool"]["must"][0]
  1123 + if "function_score" in recall_query:
  1124 + recall_query = recall_query["function_score"]["query"]
  1125 + recall_should = recall_query["bool"]["should"]
  1126 + text_knn_clause = next(
  1127 + clause["knn"]
  1128 + for clause in recall_should
  1129 + if clause.get("knn", {}).get("_name") == "knn_query"
  1130 + )
  1131 + image_knn_clause = next(
  1132 + clause["nested"]["query"]["knn"]
  1133 + for clause in recall_should
  1134 + if clause.get("nested", {}).get("_name") == "image_knn_query"
  1135 + )
  1136 + exact_text_clause = next(
  1137 + clause["script_score"]
  1138 + for clause in should
  1139 + if clause.get("script_score", {}).get("_name") == "exact_text_knn_query"
  1140 + )
  1141 + exact_image_clause = next(
  1142 + clause["nested"]["query"]["script_score"]
  1143 + for clause in should
  1144 + if clause.get("nested", {}).get("_name") == "exact_image_knn_query"
  1145 + )
  1146 + assert text_knn_clause["boost"] == 28.0
  1147 + assert exact_text_clause["script"]["params"]["boost"] == text_knn_clause["boost"]
  1148 + assert image_knn_clause["boost"] == 20.0
  1149 + assert exact_image_clause["script"]["params"]["boost"] == image_knn_clause["boost"]
  1150 +
  1151 +
  1152 +def test_searcher_skips_exact_knn_rescore_outside_rank_window(monkeypatch):
  1153 + class _VectorQueryParser:
  1154 + def parse(self, query: str, tenant_id: str, generate_vector: bool, context: Any, target_languages: Any = None):
  1155 + return _FakeParsedQuery(
  1156 + original_query=query,
  1157 + query_normalized=query,
  1158 + rewritten_query=query,
  1159 + translations={},
  1160 + query_vector=np.array([0.1, 0.2, 0.3], dtype=np.float32),
  1161 + )
  1162 +
  1163 + es_client = _FakeESClient(total_hits=20)
  1164 + base = _build_search_config(
  1165 + rerank_enabled=True,
  1166 + rerank_window=5,
  1167 + exact_knn_rescore_enabled=True,
  1168 + exact_knn_rescore_window=4,
  1169 + )
  1170 + config = SearchConfig(
  1171 + field_boosts=base.field_boosts,
  1172 + indexes=base.indexes,
  1173 + query_config=QueryConfig(
  1174 + enable_text_embedding=True,
  1175 + enable_query_rewrite=False,
  1176 + text_embedding_field="title_embedding",
  1177 + ),
  1178 + function_score=base.function_score,
  1179 + coarse_rank=base.coarse_rank,
  1180 + fine_rank=FineRankConfig(enabled=False, input_window=5, output_window=5),
  1181 + rerank=base.rerank,
  1182 + spu_config=base.spu_config,
  1183 + es_index_name=base.es_index_name,
  1184 + es_settings=base.es_settings,
  1185 + )
  1186 + searcher = _build_searcher(config, es_client)
  1187 + searcher.query_parser = _VectorQueryParser()
  1188 + context = create_request_context(reqid="exact-rescore-off", uid="u-exact-off")
  1189 +
  1190 + monkeypatch.setattr(
  1191 + "search.searcher.get_tenant_config_loader",
  1192 + lambda: SimpleNamespace(get_tenant_config=lambda tenant_id: {"index_languages": ["en"]}),
  1193 + )
  1194 +
  1195 + searcher.search(
  1196 + query="dress",
  1197 + tenant_id="162",
  1198 + from_=5,
  1199 + size=2,
  1200 + context=context,
  1201 + enable_rerank=False,
  1202 + )
  1203 +
  1204 + body = es_client.calls[0]["body"]
  1205 + assert "rescore" not in body
  1206 +
  1207 +
950 1208 def test_searcher_rerank_rank_change_falls_back_to_coarse_rank_when_fine_disabled(monkeypatch):
951 1209 es_client = _FakeESClient(total_hits=5)
952 1210 config = _build_search_config(rerank_enabled=True, rerank_window=5)
... ... @@ -970,6 +1228,12 @@ def test_searcher_rerank_rank_change_falls_back_to_coarse_rank_when_fine_disable
970 1228 lambda: SimpleNamespace(get_tenant_config=lambda tenant_id: {"index_languages": ["en"]}),
971 1229 )
972 1230  
  1231 + fine_called: Dict[str, int] = {"count": 0}
  1232 +
  1233 + def _fake_run_lightweight_rerank(**kwargs):
  1234 + fine_called["count"] += 1
  1235 + return [], {"stage": "fine"}, []
  1236 +
973 1237 def _fake_run_rerank(**kwargs):
974 1238 hits = kwargs["es_response"]["hits"]["hits"]
975 1239 hits.reverse()
... ... @@ -994,6 +1258,7 @@ def test_searcher_rerank_rank_change_falls_back_to_coarse_rank_when_fine_disable
994 1258 )
995 1259 return kwargs["es_response"], {"model": "final-reranker"}, fused_debug
996 1260  
  1261 + monkeypatch.setattr("search.rerank_client.run_lightweight_rerank", _fake_run_lightweight_rerank)
997 1262 monkeypatch.setattr("search.rerank_client.run_rerank", _fake_run_rerank)
998 1263  
999 1264 result = searcher.search(
... ... @@ -1008,7 +1273,12 @@ def test_searcher_rerank_rank_change_falls_back_to_coarse_rank_when_fine_disable
1008 1273  
1009 1274 per_result = {row["spu_id"]: row for row in result.debug_info["per_result"]}
1010 1275 moved = per_result["4"]["ranking_funnel"]
1011   - assert moved["fine_rank"]["rank"] is None
  1276 + assert fine_called["count"] == 0
  1277 + assert result.debug_info["fine_rank"]["enabled"] is False
  1278 + assert result.debug_info["fine_rank"]["applied"] is False
  1279 + assert result.debug_info["fine_rank"]["skipped_reason"] == "disabled"
  1280 + assert moved["fine_rank"]["rank"] == 5
  1281 + assert moved["fine_rank"]["rank_change"] == 0
1012 1282 assert moved["rerank"]["rank"] == 1
1013 1283 assert moved["rerank"]["rank_change"] == 4
1014 1284 assert moved["final_page"]["rank_change"] == 0
... ...
tests/test_translation_converter_resolution.py 0 → 100644
... ... @@ -0,0 +1,85 @@
  1 +from __future__ import annotations
  2 +
  3 +import sys
  4 +import types
  5 +
  6 +import pytest
  7 +
  8 +import translation.ct2_conversion as ct2_conversion
  9 +
  10 +
  11 +class _FakeTransformersConverter:
  12 + def __init__(self, model_name_or_path):
  13 + self.model_name_or_path = model_name_or_path
  14 + self.load_calls = []
  15 +
  16 + def load_model(self, model_class, resolved_model_name_or_path, **kwargs):
  17 + self.load_calls.append(
  18 + {
  19 + "model_class": model_class,
  20 + "resolved_model_name_or_path": resolved_model_name_or_path,
  21 + "kwargs": dict(kwargs),
  22 + }
  23 + )
  24 + if "dtype" in kwargs or "torch_dtype" in kwargs:
  25 + raise TypeError("M2M100ForConditionalGeneration.__init__() got an unexpected keyword argument 'dtype'")
  26 + return {"loaded": True, "path": resolved_model_name_or_path}
  27 +
  28 + def convert(self, output_dir, quantization=None, force=False):
  29 + loaded = self.load_model("FakeModel", self.model_name_or_path, dtype="float32")
  30 + return {
  31 + "loaded": loaded,
  32 + "output_dir": output_dir,
  33 + "quantization": quantization,
  34 + "force": force,
  35 + "load_calls": list(self.load_calls),
  36 + }
  37 +
  38 +
  39 +def _install_fake_ctranslate2(monkeypatch, base_converter):
  40 + converters_module = types.ModuleType("ctranslate2.converters")
  41 + converters_module.TransformersConverter = base_converter
  42 + ctranslate2_module = types.ModuleType("ctranslate2")
  43 + ctranslate2_module.converters = converters_module
  44 +
  45 + monkeypatch.setitem(sys.modules, "ctranslate2", ctranslate2_module)
  46 + monkeypatch.setitem(sys.modules, "ctranslate2.converters", converters_module)
  47 +
  48 +
  49 +def test_convert_transformers_model_retries_without_torch_dtype(monkeypatch):
  50 + _install_fake_ctranslate2(monkeypatch, _FakeTransformersConverter)
  51 + fake_transformers = types.ModuleType("transformers")
  52 + fake_transformers.AutoConfig = types.SimpleNamespace(
  53 + from_pretrained=lambda path: types.SimpleNamespace(torch_dtype="float32", path=path)
  54 + )
  55 + monkeypatch.setitem(sys.modules, "transformers", fake_transformers)
  56 +
  57 + result = ct2_conversion.convert_transformers_model("fake-model", "/tmp/out", "float16")
  58 +
  59 + assert result["loaded"] == {"loaded": True, "path": "fake-model"}
  60 + assert result["output_dir"] == "/tmp/out"
  61 + assert result["quantization"] == "float16"
  62 + assert result["force"] is False
  63 + assert len(result["load_calls"]) == 2
  64 + assert result["load_calls"][0] == {
  65 + "model_class": "FakeModel",
  66 + "resolved_model_name_or_path": "fake-model",
  67 + "kwargs": {"dtype": "float32"},
  68 + }
  69 + assert result["load_calls"][1]["model_class"] == "FakeModel"
  70 + assert result["load_calls"][1]["resolved_model_name_or_path"] == "fake-model"
  71 + assert getattr(result["load_calls"][1]["kwargs"]["config"], "torch_dtype", "missing") is None
  72 +
  73 +
  74 +def test_convert_transformers_model_preserves_unrelated_type_errors(monkeypatch):
  75 + class _AlwaysFailingConverter(_FakeTransformersConverter):
  76 + def load_model(self, model_class, resolved_model_name_or_path, **kwargs):
  77 + raise TypeError("different constructor error")
  78 +
  79 + _install_fake_ctranslate2(monkeypatch, _AlwaysFailingConverter)
  80 + fake_transformers = types.ModuleType("transformers")
  81 + fake_transformers.AutoConfig = types.SimpleNamespace(from_pretrained=lambda path: types.SimpleNamespace(path=path))
  82 + monkeypatch.setitem(sys.modules, "transformers", fake_transformers)
  83 +
  84 + with pytest.raises(TypeError, match="different constructor error"):
  85 + ct2_conversion.convert_transformers_model("fake-model", "/tmp/out", "float16")
... ...
tests/test_translation_local_backends.py
... ... @@ -201,6 +201,51 @@ def test_nllb_ctranslate2_accepts_finnish_short_code(monkeypatch):
201 201 assert backend.translator.last_translate_batch_kwargs["target_prefix"] == [["zho_Hans"]]
202 202  
203 203  
  204 +def test_nllb_ctranslate2_falls_back_to_model_id_when_local_dir_is_wrong_type(tmp_path, monkeypatch):
  205 + wrong_dir = tmp_path / "wrong-nllb"
  206 + wrong_dir.mkdir()
  207 + (wrong_dir / "config.json").write_text('{"model_type":"led"}', encoding="utf-8")
  208 +
  209 + monkeypatch.setattr(NLLBCTranslate2TranslationBackend, "_load_runtime", _stub_load_ct2_runtime)
  210 +
  211 + backend = NLLBCTranslate2TranslationBackend(
  212 + name="nllb-200-distilled-600m",
  213 + model_id="facebook/nllb-200-distilled-600M",
  214 + model_dir=str(wrong_dir),
  215 + device="cpu",
  216 + torch_dtype="float32",
  217 + batch_size=1,
  218 + max_input_length=16,
  219 + max_new_tokens=16,
  220 + num_beams=1,
  221 + )
  222 +
  223 + assert backend._model_source() == "facebook/nllb-200-distilled-600M"
  224 + assert backend._tokenizer_source() == "facebook/nllb-200-distilled-600M"
  225 +
  226 +
  227 +def test_nllb_ctranslate2_falls_back_to_model_id_when_local_dir_is_incomplete(tmp_path, monkeypatch):
  228 + incomplete_dir = tmp_path / "incomplete-nllb"
  229 + incomplete_dir.mkdir()
  230 + (incomplete_dir / "ctranslate2-float16").mkdir()
  231 +
  232 + monkeypatch.setattr(NLLBCTranslate2TranslationBackend, "_load_runtime", _stub_load_ct2_runtime)
  233 +
  234 + backend = NLLBCTranslate2TranslationBackend(
  235 + name="nllb-200-distilled-600m",
  236 + model_id="facebook/nllb-200-distilled-600M",
  237 + model_dir=str(incomplete_dir),
  238 + device="cpu",
  239 + torch_dtype="float32",
  240 + batch_size=1,
  241 + max_input_length=16,
  242 + max_new_tokens=16,
  243 + num_beams=1,
  244 + )
  245 +
  246 + assert backend._model_source() == "facebook/nllb-200-distilled-600M"
  247 +
  248 +
204 249 def test_nllb_resolves_flores_short_tags_and_iso_no():
205 250 cat = build_nllb_language_catalog(None)
206 251 assert resolve_nllb_language_code("ca", cat) == "cat_Latn"
... ...
tests/test_translator_failure_semantics.py
... ... @@ -197,6 +197,73 @@ def test_translation_route_log_focuses_on_routing_decision(monkeypatch, caplog):
197 197 ]
198 198  
199 199  
  200 +def test_service_skips_failed_backend_but_keeps_healthy_capabilities(monkeypatch):
  201 + monkeypatch.setattr(TranslationCache, "_init_redis_client", staticmethod(lambda: None))
  202 +
  203 + def _fake_create_backend(self, *, name, backend_type, cfg):
  204 + del self, backend_type, cfg
  205 + if name == "broken-nllb":
  206 + raise RuntimeError("broken model dir")
  207 +
  208 + class _Backend:
  209 + model = name
  210 +
  211 + @property
  212 + def supports_batch(self):
  213 + return True
  214 +
  215 + def translate(self, text, target_lang, source_lang=None, scene=None):
  216 + del target_lang, source_lang, scene
  217 + return text
  218 +
  219 + return _Backend()
  220 +
  221 + monkeypatch.setattr(TranslationService, "_create_backend", _fake_create_backend)
  222 + service = TranslationService(
  223 + {
  224 + "service_url": "http://127.0.0.1:6006",
  225 + "timeout_sec": 10.0,
  226 + "default_model": "llm",
  227 + "default_scene": "general",
  228 + "capabilities": {
  229 + "llm": {
  230 + "enabled": True,
  231 + "backend": "llm",
  232 + "model": "dummy-llm",
  233 + "base_url": "https://example.com",
  234 + "timeout_sec": 10.0,
  235 + "use_cache": True,
  236 + },
  237 + "broken-nllb": {
  238 + "enabled": True,
  239 + "backend": "local_nllb",
  240 + "model_id": "dummy",
  241 + "model_dir": "dummy",
  242 + "device": "cpu",
  243 + "torch_dtype": "float32",
  244 + "batch_size": 8,
  245 + "max_input_length": 16,
  246 + "max_new_tokens": 16,
  247 + "num_beams": 1,
  248 + "use_cache": True,
  249 + },
  250 + },
  251 + "cache": {
  252 + "ttl_seconds": 60,
  253 + "sliding_expiration": True,
  254 + },
  255 + }
  256 + )
  257 +
  258 + assert service.available_models == ["llm", "broken-nllb"]
  259 + assert service.loaded_models == ["llm"]
  260 + assert service.failed_models == ["broken-nllb"]
  261 + assert service.backend_errors["broken-nllb"] == "broken model dir"
  262 +
  263 + with pytest.raises(RuntimeError, match="failed to initialize"):
  264 + service.get_backend("broken-nllb")
  265 +
  266 +
200 267 def test_translation_cache_probe_models_order():
201 268 cfg = {"cache": {"model_quality_tiers": {"low": 10, "high": 50, "mid": 30}}}
202 269 assert translation_cache_probe_models(cfg, "low") == ["high", "mid", "low"]
... ...
1   -Subproject commit 03410570d4398084f5ca5c88ad968248e0f3fc5d
  1 +Subproject commit 4450c293368655449f14b5fc89e1d06e28d7f307
... ...
translation/README.md
... ... @@ -11,9 +11,9 @@
11 11 相关脚本与报告:
12 12 - 启动脚本:[`scripts/start_translator.sh`](/data/saas-search/scripts/start_translator.sh)
13 13 - 虚拟环境:[`scripts/setup_translator_venv.sh`](/data/saas-search/scripts/setup_translator_venv.sh)
14   -- 模型下载:[`scripts/download_translation_models.py`](/data/saas-search/scripts/download_translation_models.py)
15   -- 本地模型压测:[`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py)
16   -- 聚焦压测脚本:[`scripts/benchmark_translation_local_models_focus.py`](/data/saas-search/scripts/benchmark_translation_local_models_focus.py)
  14 +- 模型下载:[`scripts/translation/download_translation_models.py`](/data/saas-search/scripts/translation/download_translation_models.py)
  15 +- 本地模型压测:[`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py)
  16 +- 聚焦压测脚本:[`benchmarks/translation/benchmark_translation_local_models_focus.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models_focus.py)
17 17 - 基线性能报告:[`perf_reports/20260318/translation_local_models/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models/README.md)
18 18 - CT2 扩展报告:[`perf_reports/20260318/translation_local_models_ct2/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2/README.md)
19 19 - CT2 聚焦调优报告:[`perf_reports/20260318/translation_local_models_ct2_focus/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2_focus/README.md)
... ... @@ -493,7 +493,7 @@ cd /data/saas-search
493 493 下载全部本地模型:
494 494  
495 495 ```bash
496   -./.venv-translator/bin/python scripts/download_translation_models.py --all-local
  496 +./.venv-translator/bin/python scripts/translation/download_translation_models.py --all-local
497 497 ```
498 498  
499 499 下载完成后,默认目录应存在:
... ... @@ -550,8 +550,8 @@ curl -X POST http://127.0.0.1:6006/translate \
550 550 - 切换到 CTranslate2 后需要重新跑一轮基准,尤其关注 `nllb-200-distilled-600m` 的单条延迟、并发 tail latency 和 `opus-mt-*` 的 batch throughput。
551 551  
552 552 性能脚本:
553   -- [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py)
554   -- [`scripts/benchmark_translation_local_models_focus.py`](/data/saas-search/scripts/benchmark_translation_local_models_focus.py)
  553 +- [`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py)
  554 +- [`benchmarks/translation/benchmark_translation_local_models_focus.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models_focus.py)
555 555  
556 556 数据集:
557 557 - [`products_analyzed.csv`](/data/saas-search/products_analyzed.csv)
... ... @@ -601,14 +601,14 @@ curl -X POST http://127.0.0.1:6006/translate \
601 601  
602 602 ```bash
603 603 cd /data/saas-search
604   -./.venv-translator/bin/python scripts/benchmark_translation_local_models.py
  604 +./.venv-translator/bin/python benchmarks/translation/benchmark_translation_local_models.py
605 605 ```
606 606  
607 607 本轮扩展压测复现命令:
608 608  
609 609 ```bash
610 610 cd /data/saas-search
611   -./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
  611 +./.venv-translator/bin/python benchmarks/translation/benchmark_translation_local_models.py \
612 612 --suite extended \
613 613 --disable-cache \
614 614 --serial-items-per-case 256 \
... ... @@ -620,7 +620,7 @@ cd /data/saas-search
620 620 单模型扩展压测示例:
621 621  
622 622 ```bash
623   -./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
  623 +./.venv-translator/bin/python benchmarks/translation/benchmark_translation_local_models.py \
624 624 --single \
625 625 --suite extended \
626 626 --model opus-mt-zh-en \
... ... @@ -639,7 +639,7 @@ cd /data/saas-search
639 639 单条请求延迟复现:
640 640  
641 641 ```bash
642   -./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
  642 +./.venv-translator/bin/python benchmarks/translation/benchmark_translation_local_models.py \
643 643 --single \
644 644 --suite extended \
645 645 --model nllb-200-distilled-600m \
... ...
translation/backends/local_ctranslate2.py
... ... @@ -4,9 +4,7 @@ from __future__ import annotations
4 4  
5 5 import logging
6 6 import os
7   -import shutil
8   -import subprocess
9   -import sys
  7 +import json
10 8 import threading
11 9 from pathlib import Path
12 10 from typing import Dict, List, Optional, Sequence, Union
... ... @@ -24,6 +22,7 @@ from translation.text_splitter import (
24 22 join_translated_segments,
25 23 split_text_for_translation,
26 24 )
  25 +from translation.ct2_conversion import convert_transformers_model
27 26  
28 27 logger = logging.getLogger(__name__)
29 28  
... ... @@ -76,17 +75,18 @@ def _derive_ct2_model_dir(model_dir: str, compute_type: str) -&gt; str:
76 75 return str(Path(model_dir).expanduser() / f"ctranslate2-{normalized}")
77 76  
78 77  
79   -def _resolve_converter_binary() -> str:
80   - candidate = shutil.which("ct2-transformers-converter")
81   - if candidate:
82   - return candidate
83   - venv_candidate = Path(sys.executable).absolute().parent / "ct2-transformers-converter"
84   - if venv_candidate.exists():
85   - return str(venv_candidate)
86   - raise RuntimeError(
87   - "ct2-transformers-converter was not found. "
88   - "Ensure ctranslate2 is installed in the active translator environment."
89   - )
  78 +def _detect_local_model_type(model_dir: str) -> Optional[str]:
  79 + config_path = Path(model_dir).expanduser() / "config.json"
  80 + if not config_path.exists():
  81 + return None
  82 + try:
  83 + with open(config_path, "r", encoding="utf-8") as handle:
  84 + payload = json.load(handle) or {}
  85 + except Exception as exc:
  86 + logger.warning("Failed to inspect local translation config %s: %s", config_path, exc)
  87 + return None
  88 + model_type = str(payload.get("model_type") or "").strip().lower()
  89 + return model_type or None
90 90  
91 91  
92 92 class LocalCTranslate2TranslationBackend:
... ... @@ -144,6 +144,7 @@ class LocalCTranslate2TranslationBackend:
144 144 self.ct2_decoding_length_extra = int(ct2_decoding_length_extra)
145 145 self.ct2_decoding_length_min = max(1, int(ct2_decoding_length_min))
146 146 self._tokenizer_lock = threading.Lock()
  147 + self._local_model_source = self._resolve_local_model_source()
147 148 self._load_runtime()
148 149  
149 150 @property
... ... @@ -151,10 +152,44 @@ class LocalCTranslate2TranslationBackend:
151 152 return True
152 153  
153 154 def _tokenizer_source(self) -> str:
154   - return self.model_dir if os.path.exists(self.model_dir) else self.model_id
  155 + return self._local_model_source or self.model_id
155 156  
156 157 def _model_source(self) -> str:
157   - return self.model_dir if os.path.exists(self.model_dir) else self.model_id
  158 + return self._local_model_source or self.model_id
  159 +
  160 + def _expected_local_model_types(self) -> Optional[set[str]]:
  161 + return None
  162 +
  163 + def _resolve_local_model_source(self) -> Optional[str]:
  164 + model_path = Path(self.model_dir).expanduser()
  165 + if not model_path.exists():
  166 + return None
  167 + if not (model_path / "config.json").exists():
  168 + logger.warning(
  169 + "Local translation model_dir is incomplete | model=%s model_dir=%s missing=config.json fallback=model_id",
  170 + self.model,
  171 + model_path,
  172 + )
  173 + return None
  174 +
  175 + expected_types = self._expected_local_model_types()
  176 + if not expected_types:
  177 + return str(model_path)
  178 +
  179 + detected_type = _detect_local_model_type(str(model_path))
  180 + if detected_type is None:
  181 + return str(model_path)
  182 + if detected_type in expected_types:
  183 + return str(model_path)
  184 +
  185 + logger.warning(
  186 + "Local translation model_dir has unexpected model_type | model=%s model_dir=%s detected=%s expected=%s fallback=model_id",
  187 + self.model,
  188 + model_path,
  189 + detected_type,
  190 + sorted(expected_types),
  191 + )
  192 + return None
158 193  
159 194 def _tokenizer_kwargs(self) -> Dict[str, object]:
160 195 return {}
... ... @@ -204,7 +239,6 @@ class LocalCTranslate2TranslationBackend:
204 239 )
205 240  
206 241 ct2_path.parent.mkdir(parents=True, exist_ok=True)
207   - converter = _resolve_converter_binary()
208 242 logger.info(
209 243 "Converting translation model to CTranslate2 | name=%s source=%s output=%s quantization=%s",
210 244 self.model,
... ... @@ -213,25 +247,14 @@ class LocalCTranslate2TranslationBackend:
213 247 self.ct2_conversion_quantization,
214 248 )
215 249 try:
216   - subprocess.run(
217   - [
218   - converter,
219   - "--model",
220   - model_source,
221   - "--output_dir",
222   - str(ct2_path),
223   - "--quantization",
224   - self.ct2_conversion_quantization,
225   - ],
226   - check=True,
227   - stdout=subprocess.PIPE,
228   - stderr=subprocess.PIPE,
229   - text=True,
  250 + convert_transformers_model(
  251 + model_source,
  252 + str(ct2_path),
  253 + self.ct2_conversion_quantization,
230 254 )
231   - except subprocess.CalledProcessError as exc:
232   - stderr = exc.stderr.strip()
  255 + except Exception as exc:
233 256 raise RuntimeError(
234   - f"Failed to convert model '{self.model}' to CTranslate2: {stderr or exc}"
  257 + f"Failed to convert model '{self.model}' to CTranslate2: {exc}"
235 258 ) from exc
236 259  
237 260 def _normalize_texts(self, text: Union[str, Sequence[str]]) -> List[str]:
... ... @@ -557,6 +580,9 @@ class MarianCTranslate2TranslationBackend(LocalCTranslate2TranslationBackend):
557 580 f"Model '{self.model}' only supports target languages: {sorted(self.target_langs)}"
558 581 )
559 582  
  583 + def _expected_local_model_types(self) -> Optional[set[str]]:
  584 + return {"marian"}
  585 +
560 586  
561 587 class NLLBCTranslate2TranslationBackend(LocalCTranslate2TranslationBackend):
562 588 """Local backend for NLLB models on CTranslate2."""
... ... @@ -619,6 +645,9 @@ class NLLBCTranslate2TranslationBackend(LocalCTranslate2TranslationBackend):
619 645 if resolve_nllb_language_code(target_lang, self.language_codes) is None:
620 646 raise ValueError(f"Unsupported NLLB target language: {target_lang}")
621 647  
  648 + def _expected_local_model_types(self) -> Optional[set[str]]:
  649 + return {"m2m_100", "nllb_moe"}
  650 +
622 651 def _get_tokenizer_for_source(self, source_lang: str):
623 652 src_code = resolve_nllb_language_code(source_lang, self.language_codes)
624 653 if src_code is None:
... ...
translation/ct2_conversion.py 0 → 100644
... ... @@ -0,0 +1,52 @@
  1 +"""Helpers for converting Hugging Face translation models to CTranslate2."""
  2 +
  3 +from __future__ import annotations
  4 +
  5 +import copy
  6 +import logging
  7 +
  8 +logger = logging.getLogger(__name__)
  9 +
  10 +
  11 +def convert_transformers_model(
  12 + model_name_or_path: str,
  13 + output_dir: str,
  14 + quantization: str,
  15 + *,
  16 + force: bool = False,
  17 +) -> str:
  18 + from ctranslate2.converters import TransformersConverter
  19 + from transformers import AutoConfig
  20 +
  21 + class _CompatibleTransformersConverter(TransformersConverter):
  22 + def load_model(self, model_class, resolved_model_name_or_path, **kwargs):
  23 + try:
  24 + return super().load_model(model_class, resolved_model_name_or_path, **kwargs)
  25 + except TypeError as exc:
  26 + if "unexpected keyword argument 'dtype'" not in str(exc):
  27 + raise
  28 + if kwargs.get("dtype") is None and kwargs.get("torch_dtype") is None:
  29 + raise
  30 +
  31 + logger.warning(
  32 + "Retrying CTranslate2 model load without dtype hints | model=%s class=%s",
  33 + resolved_model_name_or_path,
  34 + getattr(model_class, "__name__", model_class),
  35 + )
  36 + retry_kwargs = dict(kwargs)
  37 + retry_kwargs.pop("dtype", None)
  38 + retry_kwargs.pop("torch_dtype", None)
  39 + config = retry_kwargs.get("config")
  40 + if config is None:
  41 + config = AutoConfig.from_pretrained(resolved_model_name_or_path)
  42 + else:
  43 + config = copy.deepcopy(config)
  44 + if hasattr(config, "dtype"):
  45 + config.dtype = None
  46 + if hasattr(config, "torch_dtype"):
  47 + config.torch_dtype = None
  48 + retry_kwargs["config"] = config
  49 + return super().load_model(model_class, resolved_model_name_or_path, **retry_kwargs)
  50 +
  51 + converter = _CompatibleTransformersConverter(model_name_or_path)
  52 + return converter.convert(output_dir=output_dir, quantization=quantization, force=force)
... ...
translation/service.py
... ... @@ -31,7 +31,12 @@ class TranslationService:
31 31 if not self._enabled_capabilities:
32 32 raise ValueError("No enabled translation backends found in services.translation.capabilities")
33 33 self._translation_cache = TranslationCache(self.config["cache"])
34   - self._backends = self._initialize_backends()
  34 + self._backends: Dict[str, TranslationBackendProtocol] = {}
  35 + self._backend_errors: Dict[str, str] = {}
  36 + self._initialize_backends()
  37 + if not self._backends:
  38 + details = ", ".join(f"{name}: {err}" for name, err in sorted(self._backend_errors.items())) or "unknown error"
  39 + raise RuntimeError(f"No translation backends could be initialized: {details}")
35 40  
36 41 def _collect_enabled_capabilities(self) -> Dict[str, Dict[str, object]]:
37 42 enabled: Dict[str, Dict[str, object]] = {}
... ... @@ -62,24 +67,47 @@ class TranslationService:
62 67 raise ValueError(f"Unsupported translation backend '{backend_type}' for capability '{name}'")
63 68 return factory(name=name, cfg=cfg)
64 69  
65   - def _initialize_backends(self) -> Dict[str, TranslationBackendProtocol]:
66   - backends: Dict[str, TranslationBackendProtocol] = {}
67   - for name, capability_cfg in self._enabled_capabilities.items():
68   - backend_type = str(capability_cfg["backend"])
69   - logger.info("Initializing translation backend | model=%s backend=%s", name, backend_type)
70   - backends[name] = self._create_backend(
  70 + def _load_backend(self, name: str) -> Optional[TranslationBackendProtocol]:
  71 + capability_cfg = self._enabled_capabilities.get(name)
  72 + if capability_cfg is None:
  73 + return None
  74 + if name in self._backends:
  75 + return self._backends[name]
  76 +
  77 + backend_type = str(capability_cfg["backend"])
  78 + logger.info("Initializing translation backend | model=%s backend=%s", name, backend_type)
  79 + try:
  80 + backend = self._create_backend(
71 81 name=name,
72 82 backend_type=backend_type,
73 83 cfg=capability_cfg,
74 84 )
75   - logger.info(
76   - "Translation backend initialized | model=%s backend=%s use_cache=%s backend_model=%s",
  85 + except Exception as exc:
  86 + error_text = str(exc).strip() or exc.__class__.__name__
  87 + self._backend_errors[name] = error_text
  88 + logger.error(
  89 + "Translation backend initialization failed | model=%s backend=%s error=%s",
77 90 name,
78 91 backend_type,
79   - bool(capability_cfg.get("use_cache")),
80   - getattr(backends[name], "model", name),
  92 + error_text,
  93 + exc_info=True,
81 94 )
82   - return backends
  95 + return None
  96 +
  97 + self._backends[name] = backend
  98 + self._backend_errors.pop(name, None)
  99 + logger.info(
  100 + "Translation backend initialized | model=%s backend=%s use_cache=%s backend_model=%s",
  101 + name,
  102 + backend_type,
  103 + bool(capability_cfg.get("use_cache")),
  104 + getattr(backend, "model", name),
  105 + )
  106 + return backend
  107 +
  108 + def _initialize_backends(self) -> None:
  109 + for name, capability_cfg in self._enabled_capabilities.items():
  110 + self._load_backend(name)
83 111  
84 112 def _create_qwen_mt_backend(self, *, name: str, cfg: Dict[str, object]) -> TranslationBackendProtocol:
85 113 from translation.backends.qwen_mt import QwenMTTranslationBackend
... ... @@ -178,13 +206,27 @@ class TranslationService:
178 206 def loaded_models(self) -> List[str]:
179 207 return list(self._backends.keys())
180 208  
  209 + @property
  210 + def failed_models(self) -> List[str]:
  211 + return list(self._backend_errors.keys())
  212 +
  213 + @property
  214 + def backend_errors(self) -> Dict[str, str]:
  215 + return dict(self._backend_errors)
  216 +
181 217 def get_backend(self, model: Optional[str] = None) -> TranslationBackendProtocol:
182 218 normalized = normalize_translation_model(self.config, model)
183   - backend = self._backends.get(normalized)
  219 + backend = self._backends.get(normalized) or self._load_backend(normalized)
184 220 if backend is None:
185   - raise ValueError(
186   - f"Translation model '{normalized}' is not enabled. "
187   - f"Available models: {', '.join(self.available_models) or 'none'}"
  221 + if normalized not in self._enabled_capabilities:
  222 + raise ValueError(
  223 + f"Translation model '{normalized}' is not enabled. "
  224 + f"Available models: {', '.join(self.available_models) or 'none'}"
  225 + )
  226 + error_text = self._backend_errors.get(normalized) or "unknown initialization error"
  227 + raise RuntimeError(
  228 + f"Translation model '{normalized}' failed to initialize: {error_text}. "
  229 + f"Loaded models: {', '.join(self.loaded_models) or 'none'}"
188 230 )
189 231 return backend
190 232  
... ...