Compare View

switch
from
...
to
 
Commits (19)
  • Previously, both `b` and `k1` were set to `0.0`. The original intention
    was to avoid two common issues in e-commerce search relevance:
    
    1. Over-penalizing longer product titles
       In product search, a shorter title should not automatically rank
    higher just because BM25 favors shorter fields. For example, for a query
    like “遥控车”, a product whose title is simply “遥控车” is not
    necessarily a better candidate than a product with a slightly longer but
    more descriptive title. In practice, extremely short titles may even
    indicate lower-quality catalog data.
    
    2. Over-rewarding repeated occurrences of the same term
       For longer queries such as “遥控喷雾翻滚多功能车玩具车”, the default
    BM25 behavior may give too much weight to a term that appears multiple
    times (for example “遥控”), even when other important query terms such
    as “喷雾” or “翻滚” are missing. This can cause products with repeated
    partial matches to outrank products that actually cover more of the user
    intent.
    
    Setting both parameters to zero was an intentional way to suppress
    length normalization and term-frequency amplification. However, after
    introducing a `combined_fields` query, this configuration becomes too
    aggressive. Since `combined_fields` scores multiple fields as a unified
    relevance signal, completely disabling both effects may also remove
    useful ranking information, especially when we still want documents
    matching more query terms across fields to be distinguishable from
    weaker matches.
    
    This update therefore relaxes the previous setting and reintroduces a
    controlled amount of BM25 normalization/scoring behavior. The goal is to
    keep the original intent — avoiding short-title bias and excessive
    repeated-term gain — while allowing the combined query to better
    preserve meaningful relevance differences across candidates.
    
    Expected effect:
    - reduce the bias toward unnaturally short product titles
    - limit score inflation caused by repeated occurrences of the same term
    - improve ranking stability for `combined_fields` queries
    - better reward candidates that cover more of the overall query intent,
      instead of those that only repeat a subset of terms
    tangwang
     
  • tangwang
     
  • tangwang
     
  • tangwang
     
  • tangwang
     
  • tangwang
     
  • 字段生成
    
    - 新增分类法属性富化能力,遵循 enriched_attributes
      相同的字段结构和处理逻辑,仅提示词和解析维度不同
    - 引入 AnalysisSchema
      抽象类,使内容富化(content)与分类法富化(taxonomy)共享批处理、缓存、提示词构建、Markdown
    解析及归一化流程
    - 重构 product_enrich.py 中原有的富化管道,将通用逻辑抽取至
      _process_batch_for_schema、_parse_markdown_to_attributes
    等函数,消除代码重复
    - 在 product_enrich_prompts.py
      中添加分类法提示词模板(TAXONOMY_ANALYSIS_PROMPT)及 Markdown
    表头定义(TAXONOMY_HEADERS)
    - 修复 Markdown
      解析器在空单元格时的行为:原实现会跳过空单元格导致列错位,现改为保留空值,确保稀疏的分类法属性列正确对齐
    - 更新 document_transformer.py 中 build_index_content_fields 函数,将
      enriched_taxonomy_attributes(中/英)写入最终索引文档
    - 调整相关单元测试(test_product_enrich_partial_mode.py
      等)以覆盖新字段路径,测试通过(14 passed)
    
    技术细节:
    - AnalysisSchema 包含
      schema_name、prompt_template、headers、field_name_prefix 等元数据
    -
    缓存键区分内容/分类法:`enrich:{schema_name}:{product_id}`,避免缓存污染
    - 分类法解析使用与 enriched_attributes
      相同的嵌套结构:`{"attribute_key": "value"}`,支持多行表格
    - 批处理大小与重试逻辑保持与原有内容富化一致
    tangwang
     
  • - `/indexer/enrich-content` 路由`enriched_taxonomy_attributes` 与
      `enriched_attributes` 一并返回
    - 新增请求参数 `analysis_kinds`(可选,默认 `["content",
      "taxonomy"]`),允许调用方按需选择内容分析类型,为后续扩展和成本控制预留空间
    - 重构缓存策略:将 `content` 与 `taxonomy` 两类分析的缓存完全隔离,缓存
      key 包含 prompt 模板、表头、输出字段定义(即 schema
    指纹),确保提示词或解析规则变更时自动失效
    - 缓存 key 仅依赖真正参与 LLM
      输入的字段(`title`、`brief`、`description`),`image_url`、`tenant_id`、`spu_id`
    不再污染缓存键,提高缓存命中率
    - 更新 API
      文档(`docs/搜索API对接指南-05-索引接口(Indexer).md`),说明新增参数与返回字段
    
    技术细节:
    - 路由层调整:在 `api/routes/indexer.py` 的 enrich-content 端点中,将
      `product_enrich.enrich_products_batch` 返回的
    `enriched_taxonomy_attributes` 字段显式加入 HTTP 响应体
    - `analysis_kinds` 参数透传至底层
      `enrich_products_batch`,支持按需跳过某一类分析(如仅需 taxonomy
    时减少 LLM 调用)
    - 缓存指纹计算位于 `product_enrich.py` 的 `_get_cache_key` 函数,对每种
      `AnalysisSchema` 独立生成;版本号通过 `schema.version` 或 prompt
    内容哈希隐式包含
    - 测试覆盖:新增 `analysis_kinds` 组合场景及缓存隔离测试
    tangwang
     
  • category_taxonomy_profile
    
    - 原 analysis_kinds
      混用了“增强类型”(content/taxonomy)与“品类特定配置”,不利于扩展不同品类的
    taxonomy 分析(如 3C、家居等)
    - 新增 enrichment_scopes 参数:支持 generic(通用增强,产出
      qanchors/enriched_tags/enriched_attributes)和
    category_taxonomy(品类增强,产出 enriched_taxonomy_attributes)
    - 新增 category_taxonomy_profile 参数:指定品类增强使用哪套
      profile(当前内置 apparel),每套 profile 包含独立的
    prompt、输出列定义、解析规则及缓存版本
    - 保留 analysis_kinds 作为兼容别名,避免破坏现有调用方
    - 重构内部 taxonomy 分析为 profile registry 模式:新增
      _get_taxonomy_schema(profile_name) 函数,根据 profile 动态返回对应的
    AnalysisSchema
    - 缓存 key 现在按“分析类型 + profile + schema 指纹 +
      输入字段哈希”隔离,确保不同品类、不同 prompt 版本自动失效
    - 更新 API 文档及微服务接口文档,明确新参数语义与使用示例
    
    技术细节:
    - 修改入口:api/routes/indexer.py 中 enrich-content
      端点,解析新参数并向下传递
    - 核心逻辑:indexer/product_enrich.py 中 enrich_products_batch 增加
      profile 参数;_process_batch_for_schema 根据 scope 和 profile 动态获取
    schema
    - 兼容层:若请求同时提供 analysis_kinds,则映射为
      enrichment_scopes(content→generic,taxonomy→category_taxonomy),category_taxonomy_profile
    默认为 "apparel"
    - 测试覆盖:新增 enrichment_scopes 组合、profile 切换及兼容模式测试
    tangwang
     
  • 本次迭代对检索系统的内容复化模块进行了较大规模的重构,将原先硬编码的“仅服饰(apparel)”品类拓展至
    taxonomy.md
    中定义的所有品类,同时优化了代码结构,降低了扩展新品类的成本。核心设计采用注册表模式(profile
    registry),按品类 profile
    分组进行批处理,并明确区分双语(zh+en)与仅英文(en)输出策略。
    
    【修改内容】
    
    1. 品类支持范围扩展
       -
    新增支持的品类:3c、bags、pet_supplies、electronics、outdoor、home_appliances、home_living、wigs、beauty、accessories、toys、shoes、sports、others
       - 所有新品类在 taxonomy 输出阶段仅返回 en 字段,避免多语言字段膨胀
       - 保留服饰(apparel)品类的双语输出(zh + en),维持原有业务兼容性
    
    2. 核心代码重构
       - `indexer/product_enrich.py`
         - 新增 `TAXONOMY_PROFILES`
           注册表,以数据驱动方式定义每个品类的输出语言、prompt
    映射、taxonomy 字段集合
         - 重写 `_enrich_taxonomy_batch`:按 profile 分组批量调用
           LLM,避免为每个品类编写独立分支
         - 引入 `_infer_profile_from_category()` 函数,从 SPU 的 category
           字段自动推断所属 profile(用于内部索引路径,解决混合目录默认
    fallback 到服饰的问题)
       - `indexer/product_enrich_prompts.py`
         - 将原有单一服饰 prompt 重构为 `PROMPT_TEMPLATES` 字典,按 profile
           存储不同提示词
         - 所有非服饰品类共享一套精简提示模板,仅要求输出 en 字段
       - `indexer/document_transformer.py`
         - 在构建 enrichment 请求时传递 category 信息,供下游按 profile 路由
         - 调整 `_build_enrich_batch` 逻辑,使批量请求支持混合品类并正确分组
       - `indexer/indexer.py`(API 层)
         - `/indexer/enrich-content` 接口的请求模型增加可选的
           `category_profile`
    字段,允许调用方显式指定品类;未指定时由服务端自动推断
         - 更新参数校验与错误处理,新增对 `others` 等兜底品类的支持
    
    3. 文档同步更新
       - `docs/搜索API对接指南-05-索引接口(Indexer).md`:增加品类 profile
         参数说明,标注非服饰品类 taxonomy 仅返回 en 字段
       -
    `docs/搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md`:更新
    enrichment 微服务的调用示例,体现多品类分组批处理
       - `taxonomy.md`:补充各品类的字段清单,明确 en
         字段为所有非服饰品类的唯一输出
    
    【技术细节】
    
    - **注册表设计**:
      ```python
      TAXONOMY_PROFILES = {
          "apparel": {"lang": ["zh", "en"], "prompt_key": "apparel",
    "fields": [...]},
          "3c": {"lang": ["en"], "prompt_key": "default", "fields": [...]},
          \# ...
      }
      ```
      新增品类只需在注册表中添加一项,并确保 `PROMPT_TEMPLATES` 中存在对应的
    prompt_key,无需修改控制流逻辑。
    
    - **按 profile 分组批处理**:
      - 原有实现:所有产品混在一起,使用同一套服饰
        prompt,导致非服饰产品被错误填充。
      - 重构后:`_enrich_taxonomy_batch` 先根据每个产品的 profile
        分组,每组独立构造 LLM
    请求,响应结果再按原始顺序合并。分组粒度可配置,避免小分组带来的过多请求开销。
    
    - **自动品类推断**:
      - 对于内部索引(非显式调用 enrichment 接口的场景),通过
        `_infer_profile_from_category` 解析 SPU 的 `category_l1/l2/l3`
    字段,映射到最匹配的
    profile。映射规则基于关键词匹配(如“手机”->“3c”,“狗粮”->“pet_supplies”),未匹配时
    fallback 到 `apparel` 以保证系统平稳过渡。
    
    - **输出字段裁剪**:
      - 由于 Elasticsearch mapping 中 `enriched_taxonomy_attributes.value`
        字段仅存储单个值(不分语言),非服饰品类的 LLM
    输出直接写入该字段;服饰品类则使用动态模板 `value.zh` 和
    `value.en`。代码中通过 `_apply_lang_output` 函数统一处理。
    
    - **代码量与可维护性**:
      - 虽然因新增大量品类定义导致总行数略有增长(~+180
        行),但条件分支数量从 5 处减少到 1 处(仅 profile
    查找)。新增品类的平均成本仅为注册表 3 行 + prompt 模板 10
    行,无需改动核心 enrichment 循环。
    
    【影响文件】
    - `indexer/product_enrich.py`
    - `indexer/product_enrich_prompts.py`
    - `indexer/document_transformer.py`
    - `indexer/indexer.py`
    - `docs/搜索API对接指南-05-索引接口(Indexer).md`
    -
    `docs/搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md`
    - `taxonomy.md`
    - `tests/test_product_enrich_partial_mode.py`(适配多 profile 测试用例)
    - `tests/test_llm_enrichment_batch_fill.py`
    - `tests/test_process_products_batching.py`
    
    【测试验证】
    - 执行单元测试与集成测试:`pytest
      tests/test_product_enrich_partial_mode.py
    tests/test_llm_enrichment_batch_fill.py
    tests/test_process_products_batching.py
    tests/ci/test_service_api_contracts.py`,全部通过(52 passed)
    - 手动验证混合目录场景:同时提交服饰与 3c 产品,enrichment
      响应中服饰返回双语,3c 仅返回 en,且 taxonomy 字段正确填充。
    - 编译检查:`py_compile` 所有修改模块无语法错误。
    
    【注意事项】
    - 本次重构未改变现有服饰品类的行为,API 向后兼容(未指定 profile
      时仍按服饰处理)。
    - 若后续需为某品类增加双语支持,只需修改注册表中的 `lang` 列表并补充
      prompt 模板,无需改动其他逻辑。
    tangwang
     
  • 2. 删掉自动推断 taxonomy profile的逻辑,build_index_content_fields()
    3. 所有 taxonomy profile 都输出 zh/en”,并把按行业切语言的逻辑去掉
       只接受显式传入的 category_taxonomy_profile
    tangwang
     
  • 问题背景:
    - scripts/
      目录下混有服务启动、数据转换、性能压测、临时脚本及历史备份目录
    - 存在大量中间迭代遗留信息,不利于维护和新人理解
    - 现行服务编排已稳定为 service_ctl up all 的集合:tei / cnclip /
      embedding / embedding-image / translator / reranker / backend /
    indexer / frontend / eval-web,不再保留 reranker-fine 默认位
    
    调整内容:
    1. 根 scripts/ 收敛为运行、运维、环境、数据处理脚本,并新增
       scripts/README.md 说明文档
    2. 性能/压测/调参脚本整体迁至 benchmarks/ 目录,同步更新
       benchmarks/README.md
    3. 人工试跑脚本迁至 tests/manual/ 目录,同步更新 tests/manual/README.md
    4. 删除明确过时内容:
       - scripts/indexer__old_2025_11/
       - scripts/start.sh
       - scripts/install_server_deps.sh
    5. 同步修正以下文档中的路径及过时描述:
       - 根目录 README.md
       - 性能报告相关文档
       - reranker/translation 模块文档
    
    技术细节:
    - 性能测试不放常规 tests/
      的原因:这类脚本依赖真实服务、GPU、模型和环境噪声,不适合作为稳定回归门禁;benchmarks/
    更贴合其定位
    - tests/manual/ 仅存放需要人工启动依赖、手工观察结果的接口试跑脚本
    - 所有迁移后的 Python 脚本已通过 py_compile 语法校验
    - 所有迁移后的 Shell 脚本已通过 bash -n 语法校验
    
    校验结果:
    - py_compile: 通过
    - bash -n: 通过
    tangwang
     
  •   - 数据转换放到 scripts/data_import/README.md
      - 诊断巡检放到 scripts/inspect/README.md
      - 运维辅助放到 scripts/ops/README.md
      - 前端辅助服务放到 scripts/frontend/frontend_server.py
      - 翻译模型下载放到 scripts/translation/download_translation_models.py
      - 临时图片补 embedding 脚本收敛成
        scripts/maintenance/embed_tenant_image_urls.py
      - Redis 监控脚本并入 redis/,现在是 scripts/redis/monitor_eviction.py
    
      同时我把真实调用链都改到了新位置:
    
      - scripts/start_frontend.sh
      - scripts/start_cnclip_service.sh
      - scripts/service_ctl.sh
      - scripts/setup_translator_venv.sh
      - scripts/README.md
    
      文档里涉及这些脚本的路径也同步修了,主要是 docs/QUICKSTART.md 和
    translation/README.md。
    tangwang
     
  • tangwang
     
  • 2. +service_enabled_by_config() {
    reranker|reranker-fine|translator 如果被关闭,则run.sh all 不启动该服务
    tangwang
     
  • tangwang
     
  • tangwang
     
  •  背景与问题
    - 现有粗排/重排依赖 `knn_query` 和 `image_knn_query` 分数,但这两路分数来自 ANN 召回,并非所有进入 rerank_window (160) 的文档都同时命中文本和图片向量召回,导致部分文档得分为 0,影响融合公式的稳定性。
    - 简单扩大 ANN 的 k 无法保证 lexical 召回带来的文档也包含两路向量分;二次查询或拉回向量本地计算均有额外开销且实现复杂。
    
     解决方案
    采用 ES rescore 机制,在第一次搜索的 `window_size` 内对每个文档执行精确的向量 script_score,并将分数以 named query 形式附加到 `matched_queries` 中,供后续 coarse/rerank 优先使用。
    
    **设计决策**:
    - **只补分,不改排序**:rescore 使用 `score_mode: total` 且 `rescore_query_weight: 0.0`,原始 `_score` 保持不变,避免干扰现有排序逻辑,风险最小。
    - **精确分数命名**:`exact_text_knn_query` 和 `exact_image_knn_query`,便于客户端识别和回退。
    - **可配置**:通过 `exact_knn_rescore_enabled` 开关和 `exact_knn_rescore_window` 控制窗口大小,默认 160。
    
     技术实现细节
    
     1. 配置扩展 (`config/config.yaml`, `config/loader.py`)
    ```yaml
    exact_knn_rescore_enabled: true
    exact_knn_rescore_window: 160
    ```
    新增配置项并注入到 `RerankConfig`。
    
     2. Searcher 构建 rescore 查询 (`search/searcher.py`)
    - 在 `_build_es_search_request` 中,当 `enable_rerank=True` 且配置开启时,构造 rescore 对象:
      - `window_size` = `exact_knn_rescore_window`
      - `query` 为一个 `bool` 查询,内嵌两个 `script_score` 子查询,分别计算文本和图片向量的点积相似度:
        ```painless
        // exact_text_knn_query
        (dotProduct(params.query_vector, 'title_embedding') + 1.0) / 2.0
        // exact_image_knn_query
        (dotProduct(params.image_query_vector, 'image_embedding.vector') + 1.0) / 2.0
        ```
      - 每个 `script_score` 都设置 `_name` 为对应的 named query。
    - 注意:当前实现的脚本分数**尚未乘以 knn_text_boost / knn_image_boost**,保持与原始 ANN 分数尺度对齐的后续待办。
    
     3. RerankClient 优先读取 exact 分数 (`search/rerank_client.py`)
    - 在 `_extract_coarse_signals` 中,从文档的 `matched_queries` 里读取 `exact_text_knn_query` 和 `exact_image_knn_query` 分数。
    - 若存在且值有效,则用作 `text_knn_score` / `image_knn_score`,并标记 `text_knn_source='exact_text_knn_query'`。
    - 若不存在,则回退到原有的 `knn_query` / `image_knn_query` (ANN 分数)。
    - 同时保留原始 ANN 分数到 `approx_text_knn_score` / `approx_image_knn_score` 供调试对比。
    
     4. 调试信息增强
    - `debug_info.per_result[*].ranking_funnel.coarse_rank.signals` 中输出 exact 分数、回退分数及来源标记,便于线上观察覆盖率和数值分布。
    
     验证结果
    - 通过单元测试 `tests/test_rerank_client.py` 和 `tests/test_search_rerank_window.py`,验证 exact 优先级、配置解析及 ES 请求体结构。
    - 线上真实查询采样(6 个 query,top160)显示:
      - **exact 覆盖率达到 100%**(文本和图片均有分),解决了原 ANN 部分缺失的问题。
      - 但 exact 分数与原始 ANN 分数存在量级差异(ANN/exact 中位数比值约 4.1 倍),原因是 exact 脚本未乘 boost 因子。
    - 当前排名影响:粗排 top10 重叠度最低降至 1/10,最大排名漂移超过 100。
    
     后续计划
    1. 对齐 exact 分与 ANN 分的尺度:在 script_score 中乘以 `knn_text_boost` / `knn_image_boost`,并对长查询额外乘 1.4。
    2. 重新评估 top10 重叠度和漂移,若收敛则可将 coarse 融合公式整体迁移至 ES rescore 阶段。
    3. 当前版本保持“只补分不改排序”的安全策略,已解决核心的分数缺失问题。
    
     涉及文件
    - `config/config.yaml`
    - `config/loader.py`
    - `search/searcher.py`
    - `search/rerank_client.py`
    - `tests/test_rerank_client.py`
    - `tests/test_search_rerank_window.py`
    tangwang
     
  •  修改内容
    
    1. **新增配置项** (`config/config.yaml`)
       - `exact_knn_rescore_enabled`: 是否开启精确向量重打分,默认 true
       - `exact_knn_rescore_window`: 重打分窗口大小,默认 160(与 rerank_window 解耦,可独立配置)
    
    2. **ES 查询层改造** (`search/searcher.py`)
       - 在第一次 ES 搜索中,根据配置为 window_size 内的文档注入 rescore 阶段
       - rescore_query 中包含两个 named script_score 子句:
         - `exact_text_knn_query`: 对文本向量执行精确点积
         - `exact_image_knn_query`: 对图片向量执行精确点积
       - 当前采用 `score_mode=total` 且 `rescore_query_weight=0.0`,**只补分不改排序**,exact 分仅出现在 `matched_queries` 中
    
    3. **统一向量得分 Boost 逻辑** (`search/es_query_builder.py`)
       - 新增 `_get_knn_plan()` 方法,集中管理文本/图片 KNN 的 boost 计算规则
       - 支持长查询(token 数超过阈值)时文本 boost 额外乘 1.4 倍
       - 精确 rescore 与 ANN 召回**共用同一套 boost 规则**,确保分数量纲一致
       - 原有 ANN 查询构建逻辑同步迁移至该统一入口
    
    4. **融合阶段得分优先级调整** (`search/rerank_client.py`)
       - `_build_hit_signal_bundle()` 中统一处理向量得分读取
       - 优先从 `matched_queries` 读取 `exact_text_knn_query` / `exact_image_knn_query`
       - 若不存在则回退到原 `knn_query` / `image_knn_query`(ANN 得分)
       - 覆盖 coarse_rank、fine_rank、rerank 三个阶段,避免重复补丁
    
    5. **测试覆盖**
       - `tests/test_es_query_builder.py`: 验证 ANN 与 exact 共用 boost 规则
       - `tests/test_search_rerank_window.py`: 验证 rescore 窗口及 named query 正确注入
       - `tests/test_rerank_client.py`: 验证 exact 优先、回退 ANN 的逻辑
    
     技术细节
    
    - **精确向量计算脚本** (Painless)
      ```painless
      // 文本 (dotProduct + 1.0) / 2.0
      (dotProduct(params.query_vector, 'title_embedding') + 1.0) / 2.0
      // 图片同理,字段为 'image_embedding.vector'
      ```
      乘以统一的 boost(来自配置 `knn_text_boost` / `knn_image_boost` 及长查询放大因子)。
    
    - **named query 保留机制**
      - 主查询中已开启 `include_named_queries_score: true`
      - rescore 阶段命名的脚本得分会合并到每个 hit 的 `matched_queries` 中
      - 通过 `_extract_named_score()` 按名称提取,与原始 ANN 得分访问方式完全一致
    
    - **性能影响** (基于 top160、6 条真实查询、warm-up 后 3 轮平均)
      - `elasticsearch_search_primary` 耗时: 124.71ms → 136.60ms (+11.89ms, +9.53%)
      - `total_search` 受其他组件抖动影响较大,不作为主要参考
      - 该开销在可接受范围内,未出现超时或资源瓶颈
    
     配置示例
    
    ```yaml
    search:
      exact_knn_rescore_enabled: true
      exact_knn_rescore_window: 160
      knn_text_boost: 4.0
      knn_image_boost: 4.0
      long_query_token_threshold: 8
      long_query_text_boost_factor: 1.4
    ```
    
     已知问题与后续计划
    
    - 当前版本经过调参实验发现,开启 exact rescore 后部分 query(强类型约束 + 多风格/颜色相似)的主指标相比 baseline(exact=false)下降约 0.031(0.6009 → 0.5697)
    - 根因:exact 将 KNN 从稀疏辅助信号变为 dense 排序因子,coarse 阶段排序语义变化,单纯调整现有 `knn_bias/exponent` 无法完全恢复
    - 后续迭代方向:**coarse 阶段暂不强制使用 exact**,仅 fine/rerank 优先 exact;或 coarse 采用“ANN 优先,exact 只补缺失”策略,再重新评估
    
     相关文件
    
    - `config/config.yaml`
    - `search/searcher.py`
    - `search/es_query_builder.py`
    - `search/rerank_client.py`
    - `tests/test_es_query_builder.py`
    - `tests/test_search_rerank_window.py`
    - `tests/test_rerank_client.py`
    - `scripts/evaluation/exact_rescore_coarse_tuning_round2.json` (调参实验记录)
    tangwang
     
Showing 137 changed files   Show diff stats
@@ -4,6 +4,7 @@ @@ -4,6 +4,7 @@
4 ES_HOST=http://localhost:9200 4 ES_HOST=http://localhost:9200
5 ES_USERNAME=saas 5 ES_USERNAME=saas
6 ES_PASSWORD=4hOaLaf41y2VuI8y 6 ES_PASSWORD=4hOaLaf41y2VuI8y
  7 +ES_AUTH="${ES_USERNAME}:${ES_PASSWORD}"
7 8
8 # Redis Configuration (Optional) - AI 生产 10.200.16.14:6479 9 # Redis Configuration (Optional) - AI 生产 10.200.16.14:6479
9 REDIS_HOST=10.200.16.14 10 REDIS_HOST=10.200.16.14
@@ -8,6 +8,7 @@ @@ -8,6 +8,7 @@
8 ES_HOST=http://localhost:9200 8 ES_HOST=http://localhost:9200
9 ES_USERNAME=saas 9 ES_USERNAME=saas
10 ES_PASSWORD= 10 ES_PASSWORD=
  11 +ES_AUTH="${ES_USERNAME}:${ES_PASSWORD}"
11 12
12 # Redis (生产默认 10.200.16.14:6479,密码见 docs/QUICKSTART.md §1.6) 13 # Redis (生产默认 10.200.16.14:6479,密码见 docs/QUICKSTART.md §1.6)
13 REDIS_HOST=10.200.16.14 14 REDIS_HOST=10.200.16.14
@@ -77,9 +77,11 @@ source activate.sh @@ -77,9 +77,11 @@ source activate.sh
77 # Generate test data (Tenant1 Mock + Tenant2 CSV) 77 # Generate test data (Tenant1 Mock + Tenant2 CSV)
78 ./scripts/mock_data.sh 78 ./scripts/mock_data.sh
79 79
80 -# Ingest data to Elasticsearch  
81 -./scripts/ingest.sh <tenant_id> [recreate] # e.g., ./scripts/ingest.sh 1 true  
82 -python main.py ingest data.csv --limit 1000 --batch-size 50 80 +# Create tenant index structure
  81 +./scripts/create_tenant_index.sh <tenant_id>
  82 +
  83 +# Build / refresh suggestion index
  84 +./scripts/build_suggestions.sh <tenant_id> --mode incremental
83 ``` 85 ```
84 86
85 ### Running Services 87 ### Running Services
@@ -100,10 +102,10 @@ python main.py serve --host 0.0.0.0 --port 6002 --reload @@ -100,10 +102,10 @@ python main.py serve --host 0.0.0.0 --port 6002 --reload
100 # Run all tests 102 # Run all tests
101 pytest tests/ 103 pytest tests/
102 104
103 -# Run specific test types  
104 -pytest tests/unit/ # Unit tests  
105 -pytest tests/integration/ # Integration tests  
106 -pytest -m "api" # API tests only 105 +# Run focused regression sets
  106 +python -m pytest tests/ci -q
  107 +pytest tests/test_rerank_client.py
  108 +pytest tests/test_query_parser_mixed_language.py
107 109
108 # Test search from command line 110 # Test search from command line
109 python main.py search "query" --tenant-id 1 --size 10 111 python main.py search "query" --tenant-id 1 --size 10
@@ -114,12 +116,8 @@ python main.py search &quot;query&quot; --tenant-id 1 --size 10 @@ -114,12 +116,8 @@ python main.py search &quot;query&quot; --tenant-id 1 --size 10
114 # Stop all services 116 # Stop all services
115 ./scripts/stop.sh 117 ./scripts/stop.sh
116 118
117 -# Test environment (for CI/development)  
118 -./scripts/start_test_environment.sh  
119 -./scripts/stop_test_environment.sh  
120 -  
121 -# Install server dependencies  
122 -./scripts/install_server_deps.sh 119 +# Run CI contract tests
  120 +./scripts/run_ci_tests.sh
123 ``` 121 ```
124 122
125 ## Architecture Overview 123 ## Architecture Overview
@@ -585,7 +583,7 @@ GET /admin/stats # Index statistics @@ -585,7 +583,7 @@ GET /admin/stats # Index statistics
585 ./scripts/start_frontend.sh # Frontend UI (port 6003) 583 ./scripts/start_frontend.sh # Frontend UI (port 6003)
586 584
587 # Data Operations 585 # Data Operations
588 -./scripts/ingest.sh <tenant_id> [recreate] # Index data 586 +./scripts/create_tenant_index.sh <tenant_id> # Create tenant index
589 ./scripts/mock_data.sh # Generate test data 587 ./scripts/mock_data.sh # Generate test data
590 588
591 # Testing 589 # Testing
@@ -154,7 +154,8 @@ class SearchRequest(BaseModel): @@ -154,7 +154,8 @@ class SearchRequest(BaseModel):
154 enable_rerank: Optional[bool] = Field( 154 enable_rerank: Optional[bool] = Field(
155 None, 155 None,
156 description=( 156 description=(
157 - "是否开启重排(调用外部重排服务对 ES 结果进行二次排序)。" 157 + "是否开启最终重排(调用外部 rerank 服务改写上一阶段顺序)。"
  158 + "关闭时仍保留 coarse/fine 流程,仅在 rerank 阶段保序透传。"
158 "不传则使用服务端配置 rerank.enabled(默认开启)。" 159 "不传则使用服务端配置 rerank.enabled(默认开启)。"
159 ) 160 )
160 ) 161 )
api/routes/indexer.py
@@ -7,7 +7,7 @@ @@ -7,7 +7,7 @@
7 import asyncio 7 import asyncio
8 import re 8 import re
9 from fastapi import APIRouter, HTTPException 9 from fastapi import APIRouter, HTTPException
10 -from typing import Any, Dict, List, Optional 10 +from typing import Any, Dict, List, Literal, Optional
11 from pydantic import BaseModel, Field 11 from pydantic import BaseModel, Field
12 import logging 12 import logging
13 from sqlalchemy import text 13 from sqlalchemy import text
@@ -19,6 +19,11 @@ logger = logging.getLogger(__name__) @@ -19,6 +19,11 @@ logger = logging.getLogger(__name__)
19 19
20 router = APIRouter(prefix="/indexer", tags=["indexer"]) 20 router = APIRouter(prefix="/indexer", tags=["indexer"])
21 21
  22 +SUPPORTED_CATEGORY_TAXONOMY_PROFILES = (
  23 + "apparel, 3c, bags, pet_supplies, electronics, outdoor, "
  24 + "home_appliances, home_living, wigs, beauty, accessories, toys, shoes, sports, others"
  25 +)
  26 +
22 27
23 class ReindexRequest(BaseModel): 28 class ReindexRequest(BaseModel):
24 """全量重建索引请求""" 29 """全量重建索引请求"""
@@ -88,11 +93,42 @@ class EnrichContentItem(BaseModel): @@ -88,11 +93,42 @@ class EnrichContentItem(BaseModel):
88 93
89 class EnrichContentRequest(BaseModel): 94 class EnrichContentRequest(BaseModel):
90 """ 95 """
91 - 内容理解字段生成请求:根据商品标题批量生成 qanchors、enriched_attributes、tags 96 + 内容理解字段生成请求:根据商品标题批量生成通用增强字段与品类 taxonomy 字段
92 供外部 indexer 在自行组织 doc 时调用,与翻译、向量化等微服务并列。 97 供外部 indexer 在自行组织 doc 时调用,与翻译、向量化等微服务并列。
93 """ 98 """
94 tenant_id: str = Field(..., description="租户 ID,用于请求路由与结果归属,不参与缓存键") 99 tenant_id: str = Field(..., description="租户 ID,用于请求路由与结果归属,不参与缓存键")
95 items: List[EnrichContentItem] = Field(..., description="待分析的 SPU 列表(spu_id + title,可附带 brief/description/image_url)") 100 items: List[EnrichContentItem] = Field(..., description="待分析的 SPU 列表(spu_id + title,可附带 brief/description/image_url)")
  101 + enrichment_scopes: Optional[List[Literal["generic", "category_taxonomy"]]] = Field(
  102 + default=None,
  103 + description=(
  104 + "要执行的增强范围。"
  105 + "`generic` 返回 qanchors/enriched_tags/enriched_attributes;"
  106 + "`category_taxonomy` 返回 enriched_taxonomy_attributes。"
  107 + "默认两者都执行。"
  108 + ),
  109 + )
  110 + category_taxonomy_profile: str = Field(
  111 + "apparel",
  112 + description=(
  113 + "品类 taxonomy profile。默认 `apparel`。"
  114 + f"当前支持:{SUPPORTED_CATEGORY_TAXONOMY_PROFILES}。"
  115 + "其中除 `apparel` 外,其余 profile 的 taxonomy 输出仅返回 `en`。"
  116 + ),
  117 + )
  118 + analysis_kinds: Optional[List[Literal["content", "taxonomy"]]] = Field(
  119 + default=None,
  120 + description="Deprecated alias of enrichment_scopes. `content` -> `generic`, `taxonomy` -> `category_taxonomy`.",
  121 + )
  122 +
  123 + def resolved_enrichment_scopes(self) -> List[str]:
  124 + if self.enrichment_scopes:
  125 + return list(self.enrichment_scopes)
  126 + if self.analysis_kinds:
  127 + mapped = []
  128 + for item in self.analysis_kinds:
  129 + mapped.append("generic" if item == "content" else "category_taxonomy")
  130 + return mapped
  131 + return ["generic", "category_taxonomy"]
96 132
97 133
98 @router.post("/reindex") 134 @router.post("/reindex")
@@ -440,20 +476,31 @@ async def build_docs_from_db(request: BuildDocsFromDbRequest): @@ -440,20 +476,31 @@ async def build_docs_from_db(request: BuildDocsFromDbRequest):
440 raise HTTPException(status_code=500, detail=f"Internal server error: {str(e)}") 476 raise HTTPException(status_code=500, detail=f"Internal server error: {str(e)}")
441 477
442 478
443 -def _run_enrich_content(tenant_id: str, items: List[Dict[str, str]]) -> List[Dict[str, Any]]: 479 +def _run_enrich_content(
  480 + tenant_id: str,
  481 + items: List[Dict[str, str]],
  482 + enrichment_scopes: Optional[List[str]] = None,
  483 + category_taxonomy_profile: str = "apparel",
  484 +) -> List[Dict[str, Any]]:
444 """ 485 """
445 同步执行内容理解,返回与 ES mapping 对齐的字段结构。 486 同步执行内容理解,返回与 ES mapping 对齐的字段结构。
446 语言策略由 product_enrich 内部统一决定,路由层不参与。 487 语言策略由 product_enrich 内部统一决定,路由层不参与。
447 """ 488 """
448 from indexer.product_enrich import build_index_content_fields 489 from indexer.product_enrich import build_index_content_fields
449 490
450 - results = build_index_content_fields(items=items, tenant_id=tenant_id) 491 + results = build_index_content_fields(
  492 + items=items,
  493 + tenant_id=tenant_id,
  494 + enrichment_scopes=enrichment_scopes,
  495 + category_taxonomy_profile=category_taxonomy_profile,
  496 + )
451 return [ 497 return [
452 { 498 {
453 "spu_id": item["id"], 499 "spu_id": item["id"],
454 "qanchors": item["qanchors"], 500 "qanchors": item["qanchors"],
455 "enriched_attributes": item["enriched_attributes"], 501 "enriched_attributes": item["enriched_attributes"],
456 "enriched_tags": item["enriched_tags"], 502 "enriched_tags": item["enriched_tags"],
  503 + "enriched_taxonomy_attributes": item["enriched_taxonomy_attributes"],
457 **({"error": item["error"]} if item.get("error") else {}), 504 **({"error": item["error"]} if item.get("error") else {}),
458 } 505 }
459 for item in results 506 for item in results
@@ -463,15 +510,15 @@ def _run_enrich_content(tenant_id: str, items: List[Dict[str, str]]) -&gt; List[Dic @@ -463,15 +510,15 @@ def _run_enrich_content(tenant_id: str, items: List[Dict[str, str]]) -&gt; List[Dic
463 @router.post("/enrich-content") 510 @router.post("/enrich-content")
464 async def enrich_content(request: EnrichContentRequest): 511 async def enrich_content(request: EnrichContentRequest):
465 """ 512 """
466 - 内容理解字段生成接口:根据商品标题批量生成 qanchors、enriched_attributes、tags 513 + 内容理解字段生成接口:根据商品标题批量生成通用增强字段与品类 taxonomy 字段
467 514
468 使用场景: 515 使用场景:
469 - 外部 indexer 采用「微服务组合」方式自己组织 doc 时,可调用本接口获取 LLM 生成的 516 - 外部 indexer 采用「微服务组合」方式自己组织 doc 时,可调用本接口获取 LLM 生成的
470 锚文本与语义属性,再与翻译、向量化结果合并写入 ES。 517 锚文本与语义属性,再与翻译、向量化结果合并写入 ES。
471 - 与 /indexer/build-docs 解耦,避免 build-docs 因 LLM 耗时过长而阻塞;调用方可 518 - 与 /indexer/build-docs 解耦,避免 build-docs 因 LLM 耗时过长而阻塞;调用方可
472 - 先拿不含 qanchors/enriched_tags 的 doc,再异步或离线补齐本接口结果后更新 ES。 519 + 先拿不含 qanchors/enriched_tags/taxonomy attributes 的 doc,再异步或离线补齐本接口结果后更新 ES。
473 520
474 - 实现逻辑与 indexer.product_enrich.analyze_products 一致,支持多语言与 Redis 缓存。 521 + 实现逻辑与 indexer.product_enrich.build_index_content_fields 一致,支持多语言与 Redis 缓存。
475 """ 522 """
476 try: 523 try:
477 if not request.items: 524 if not request.items:
@@ -493,15 +540,20 @@ async def enrich_content(request: EnrichContentRequest): @@ -493,15 +540,20 @@ async def enrich_content(request: EnrichContentRequest):
493 for it in request.items 540 for it in request.items
494 ] 541 ]
495 loop = asyncio.get_event_loop() 542 loop = asyncio.get_event_loop()
  543 + enrichment_scopes = request.resolved_enrichment_scopes()
496 result = await loop.run_in_executor( 544 result = await loop.run_in_executor(
497 None, 545 None,
498 lambda: _run_enrich_content( 546 lambda: _run_enrich_content(
499 tenant_id=request.tenant_id, 547 tenant_id=request.tenant_id,
500 - items=items_payload 548 + items=items_payload,
  549 + enrichment_scopes=enrichment_scopes,
  550 + category_taxonomy_profile=request.category_taxonomy_profile,
501 ), 551 ),
502 ) 552 )
503 return { 553 return {
504 "tenant_id": request.tenant_id, 554 "tenant_id": request.tenant_id,
  555 + "enrichment_scopes": enrichment_scopes,
  556 + "category_taxonomy_profile": request.category_taxonomy_profile,
505 "results": result, 557 "results": result,
506 "total": len(result), 558 "total": len(result),
507 } 559 }
api/translator_app.py
@@ -271,16 +271,20 @@ async def lifespan(_: FastAPI): @@ -271,16 +271,20 @@ async def lifespan(_: FastAPI):
271 """Initialize all enabled translation backends on process startup.""" 271 """Initialize all enabled translation backends on process startup."""
272 logger.info("Starting Translation Service API") 272 logger.info("Starting Translation Service API")
273 service = get_translation_service() 273 service = get_translation_service()
  274 + failed_models = list(getattr(service, "failed_models", []))
  275 + backend_errors = dict(getattr(service, "backend_errors", {}))
274 logger.info( 276 logger.info(
275 - "Translation service ready | default_model=%s default_scene=%s available_models=%s loaded_models=%s", 277 + "Translation service ready | default_model=%s default_scene=%s available_models=%s loaded_models=%s failed_models=%s",
276 service.config["default_model"], 278 service.config["default_model"],
277 service.config["default_scene"], 279 service.config["default_scene"],
278 service.available_models, 280 service.available_models,
279 service.loaded_models, 281 service.loaded_models,
  282 + failed_models,
280 ) 283 )
281 logger.info( 284 logger.info(
282 - "Translation backends initialized on startup | models=%s", 285 + "Translation backends initialized on startup | loaded=%s failed=%s",
283 service.loaded_models, 286 service.loaded_models,
  287 + backend_errors,
284 ) 288 )
285 verbose_logger.info( 289 verbose_logger.info(
286 "Translation startup detail | capabilities=%s cache_ttl_seconds=%s cache_sliding_expiration=%s", 290 "Translation startup detail | capabilities=%s cache_ttl_seconds=%s cache_sliding_expiration=%s",
@@ -316,11 +320,14 @@ async def health_check(): @@ -316,11 +320,14 @@ async def health_check():
316 """Health check endpoint.""" 320 """Health check endpoint."""
317 try: 321 try:
318 service = get_translation_service() 322 service = get_translation_service()
  323 + failed_models = list(getattr(service, "failed_models", []))
  324 + backend_errors = dict(getattr(service, "backend_errors", {}))
319 logger.info( 325 logger.info(
320 - "Health check | default_model=%s default_scene=%s loaded_models=%s", 326 + "Health check | default_model=%s default_scene=%s loaded_models=%s failed_models=%s",
321 service.config["default_model"], 327 service.config["default_model"],
322 service.config["default_scene"], 328 service.config["default_scene"],
323 service.loaded_models, 329 service.loaded_models,
  330 + failed_models,
324 ) 331 )
325 return { 332 return {
326 "status": "healthy", 333 "status": "healthy",
@@ -330,6 +337,8 @@ async def health_check(): @@ -330,6 +337,8 @@ async def health_check():
330 "available_models": service.available_models, 337 "available_models": service.available_models,
331 "enabled_capabilities": get_enabled_translation_models(service.config), 338 "enabled_capabilities": get_enabled_translation_models(service.config),
332 "loaded_models": service.loaded_models, 339 "loaded_models": service.loaded_models,
  340 + "failed_models": failed_models,
  341 + "backend_errors": backend_errors,
333 } 342 }
334 except Exception as e: 343 except Exception as e:
335 logger.error(f"Health check failed: {e}") 344 logger.error(f"Health check failed: {e}")
@@ -463,6 +472,10 @@ async def translate(request: TranslationRequest, http_request: Request): @@ -463,6 +472,10 @@ async def translate(request: TranslationRequest, http_request: Request):
463 latency_ms = (time.perf_counter() - request_started) * 1000 472 latency_ms = (time.perf_counter() - request_started) * 1000
464 logger.warning("Translation validation error | error=%s latency_ms=%.2f", e, latency_ms) 473 logger.warning("Translation validation error | error=%s latency_ms=%.2f", e, latency_ms)
465 raise HTTPException(status_code=400, detail=str(e)) from e 474 raise HTTPException(status_code=400, detail=str(e)) from e
  475 + except RuntimeError as e:
  476 + latency_ms = (time.perf_counter() - request_started) * 1000
  477 + logger.warning("Translation backend unavailable | error=%s latency_ms=%.2f", e, latency_ms)
  478 + raise HTTPException(status_code=503, detail=str(e)) from e
466 except Exception as e: 479 except Exception as e:
467 latency_ms = (time.perf_counter() - request_started) * 1000 480 latency_ms = (time.perf_counter() - request_started) * 1000
468 logger.error("Translation error | error=%s latency_ms=%.2f", e, latency_ms, exc_info=True) 481 logger.error("Translation error | error=%s latency_ms=%.2f", e, latency_ms, exc_info=True)
benchmarks/README.md 0 → 100644
@@ -0,0 +1,17 @@ @@ -0,0 +1,17 @@
  1 +# Benchmarks
  2 +
  3 +基准压测脚本统一放在 `benchmarks/`,不再和 `scripts/` 里的服务启动/运维脚本混放。
  4 +
  5 +目录约定:
  6 +
  7 +- `benchmarks/perf_api_benchmark.py`:通用 HTTP 接口压测入口
  8 +- `benchmarks/reranker/`:reranker 定向 benchmark、smoke、手工对比脚本
  9 +- `benchmarks/translation/`:translation 本地模型 benchmark
  10 +
  11 +这些脚本默认不是 CI 测试的一部分,因为它们通常具备以下特征:
  12 +
  13 +- 依赖真实服务、GPU、模型或特定数据集
  14 +- 结果受机器配置和运行时负载影响,不适合作为稳定回归门禁
  15 +- 更多用于容量评估、调参和问题复现,而不是功能正确性判定
  16 +
  17 +如果某个性能场景需要进入自动化回归,应新增到 `tests/` 下并明确收敛输入、环境和判定阈值,而不是直接复用这里的基准脚本。
scripts/perf_api_benchmark.py renamed to benchmarks/perf_api_benchmark.py
@@ -11,13 +11,13 @@ Default scenarios (aligned with docs/搜索API对接指南 分册,如 -01 / -0 @@ -11,13 +11,13 @@ Default scenarios (aligned with docs/搜索API对接指南 分册,如 -01 / -0
11 - rerank POST /rerank 11 - rerank POST /rerank
12 12
13 Examples: 13 Examples:
14 - python scripts/perf_api_benchmark.py --scenario backend_search --duration 30 --concurrency 20 --tenant-id 162  
15 - python scripts/perf_api_benchmark.py --scenario backend_suggest --duration 30 --concurrency 50 --tenant-id 162  
16 - python scripts/perf_api_benchmark.py --scenario all --duration 60 --concurrency 80 --tenant-id 162  
17 - python scripts/perf_api_benchmark.py --scenario all --cases-file scripts/perf_cases.json.example --output perf_result.json 14 + python benchmarks/perf_api_benchmark.py --scenario backend_search --duration 30 --concurrency 20 --tenant-id 162
  15 + python benchmarks/perf_api_benchmark.py --scenario backend_suggest --duration 30 --concurrency 50 --tenant-id 162
  16 + python benchmarks/perf_api_benchmark.py --scenario all --duration 60 --concurrency 80 --tenant-id 162
  17 + python benchmarks/perf_api_benchmark.py --scenario all --cases-file benchmarks/perf_cases.json.example --output perf_result.json
18 # Embedding admission / priority (query param `priority`; same semantics as embedding service): 18 # Embedding admission / priority (query param `priority`; same semantics as embedding service):
19 - python scripts/perf_api_benchmark.py --scenario embed_text --embed-text-priority 1 --duration 30 --concurrency 20  
20 - python scripts/perf_api_benchmark.py --scenario embed_image --embed-image-priority 1 --duration 30 --concurrency 10 19 + python benchmarks/perf_api_benchmark.py --scenario embed_text --embed-text-priority 1 --duration 30 --concurrency 20
  20 + python benchmarks/perf_api_benchmark.py --scenario embed_image --embed-image-priority 1 --duration 30 --concurrency 10
21 """ 21 """
22 22
23 from __future__ import annotations 23 from __future__ import annotations
@@ -229,7 +229,7 @@ def apply_embed_priority_params( @@ -229,7 +229,7 @@ def apply_embed_priority_params(
229 ) -> None: 229 ) -> None:
230 """ 230 """
231 Merge default `priority` query param into embed templates when absent. 231 Merge default `priority` query param into embed templates when absent.
232 - `scripts/perf_cases.json` may set per-request `params.priority` to override. 232 + `benchmarks/perf_cases.json` may set per-request `params.priority` to override.
233 """ 233 """
234 mapping = { 234 mapping = {
235 "embed_text": max(0, int(embed_text_priority)), 235 "embed_text": max(0, int(embed_text_priority)),
scripts/perf_cases.json.example renamed to benchmarks/perf_cases.json.example
scripts/benchmark_reranker_1000docs.sh renamed to benchmarks/reranker/benchmark_reranker_1000docs.sh
@@ -8,7 +8,7 @@ @@ -8,7 +8,7 @@
8 # Outputs JSON reports under perf_reports/<date>/reranker_1000docs/ 8 # Outputs JSON reports under perf_reports/<date>/reranker_1000docs/
9 # 9 #
10 # Usage: 10 # Usage:
11 -# ./scripts/benchmark_reranker_1000docs.sh 11 +# ./benchmarks/reranker/benchmark_reranker_1000docs.sh
12 # Optional env: 12 # Optional env:
13 # BATCH_SIZES="24 32 48 64" 13 # BATCH_SIZES="24 32 48 64"
14 # C1_REQUESTS=4 14 # C1_REQUESTS=4
@@ -85,7 +85,7 @@ run_bench() { @@ -85,7 +85,7 @@ run_bench() {
85 local c="$2" 85 local c="$2"
86 local req="$3" 86 local req="$3"
87 local out="${OUT_DIR}/rerank_bs${bs}_c${c}_r${req}.json" 87 local out="${OUT_DIR}/rerank_bs${bs}_c${c}_r${req}.json"
88 - .venv/bin/python scripts/perf_api_benchmark.py \ 88 + .venv/bin/python benchmarks/perf_api_benchmark.py \
89 --scenario rerank \ 89 --scenario rerank \
90 --tenant-id "${TENANT_ID}" \ 90 --tenant-id "${TENANT_ID}" \
91 --reranker-base "${RERANK_BASE}" \ 91 --reranker-base "${RERANK_BASE}" \
scripts/benchmark_reranker_gguf_local.py renamed to benchmarks/reranker/benchmark_reranker_gguf_local.py
@@ -8,8 +8,8 @@ Runs the backend directly in a fresh process per config to measure: @@ -8,8 +8,8 @@ Runs the backend directly in a fresh process per config to measure:
8 - single-request rerank latency 8 - single-request rerank latency
9 9
10 Example: 10 Example:
11 - ./.venv-reranker-gguf/bin/python scripts/benchmark_reranker_gguf_local.py  
12 - ./.venv-reranker-gguf-06b/bin/python scripts/benchmark_reranker_gguf_local.py --backend-name qwen3_gguf_06b --docs 400 11 + ./.venv-reranker-gguf/bin/python benchmarks/reranker/benchmark_reranker_gguf_local.py
  12 + ./.venv-reranker-gguf-06b/bin/python benchmarks/reranker/benchmark_reranker_gguf_local.py --backend-name qwen3_gguf_06b --docs 400
13 """ 13 """
14 14
15 from __future__ import annotations 15 from __future__ import annotations
scripts/benchmark_reranker_random_titles.py renamed to benchmarks/reranker/benchmark_reranker_random_titles.py
@@ -10,10 +10,10 @@ Each invocation runs 3 warmup requests with n=400 first; those are not timed for @@ -10,10 +10,10 @@ Each invocation runs 3 warmup requests with n=400 first; those are not timed for
10 10
11 Example: 11 Example:
12 source activate.sh 12 source activate.sh
13 - python scripts/benchmark_reranker_random_titles.py 386  
14 - python scripts/benchmark_reranker_random_titles.py 40,80,100  
15 - python scripts/benchmark_reranker_random_titles.py 40,80,100 --repeat 3 --seed 42  
16 - RERANK_BASE=http://127.0.0.1:6007 python scripts/benchmark_reranker_random_titles.py 200 13 + python benchmarks/reranker/benchmark_reranker_random_titles.py 386
  14 + python benchmarks/reranker/benchmark_reranker_random_titles.py 40,80,100
  15 + python benchmarks/reranker/benchmark_reranker_random_titles.py 40,80,100 --repeat 3 --seed 42
  16 + RERANK_BASE=http://127.0.0.1:6007 python benchmarks/reranker/benchmark_reranker_random_titles.py 200
17 """ 17 """
18 18
19 from __future__ import annotations 19 from __future__ import annotations
tests/reranker_performance/curl1.sh renamed to benchmarks/reranker/manual/curl1.sh
tests/reranker_performance/curl1_simple.sh renamed to benchmarks/reranker/manual/curl1_simple.sh
tests/reranker_performance/curl2.sh renamed to benchmarks/reranker/manual/curl2.sh
tests/reranker_performance/rerank_performance_compare.sh renamed to benchmarks/reranker/manual/rerank_performance_compare.sh
scripts/patch_rerank_vllm_benchmark_config.py renamed to benchmarks/reranker/patch_rerank_vllm_benchmark_config.py
@@ -73,7 +73,7 @@ def main() -&gt; int: @@ -73,7 +73,7 @@ def main() -&gt; int:
73 p.add_argument( 73 p.add_argument(
74 "--config", 74 "--config",
75 type=Path, 75 type=Path,
76 - default=Path(__file__).resolve().parent.parent / "config" / "config.yaml", 76 + default=Path(__file__).resolve().parents[2] / "config" / "config.yaml",
77 ) 77 )
78 p.add_argument("--backend", choices=("qwen3_vllm", "qwen3_vllm_score"), required=True) 78 p.add_argument("--backend", choices=("qwen3_vllm", "qwen3_vllm_score"), required=True)
79 p.add_argument( 79 p.add_argument(
scripts/run_reranker_vllm_instruction_benchmark.sh renamed to benchmarks/reranker/run_reranker_vllm_instruction_benchmark.sh
@@ -55,13 +55,13 @@ run_one() { @@ -55,13 +55,13 @@ run_one() {
55 local jf="${OUT_DIR}/${backend}_${fmt}.json" 55 local jf="${OUT_DIR}/${backend}_${fmt}.json"
56 56
57 echo "========== ${tag} ==========" 57 echo "========== ${tag} =========="
58 - "$PYTHON" "${ROOT}/scripts/patch_rerank_vllm_benchmark_config.py" \ 58 + "$PYTHON" "${ROOT}/benchmarks/reranker/patch_rerank_vllm_benchmark_config.py" \
59 --backend "$backend" --instruction-format "$fmt" 59 --backend "$backend" --instruction-format "$fmt"
60 60
61 "${ROOT}/restart.sh" reranker 61 "${ROOT}/restart.sh" reranker
62 wait_health "$backend" "$fmt" 62 wait_health "$backend" "$fmt"
63 63
64 - if ! "$PYTHON" "${ROOT}/scripts/benchmark_reranker_random_titles.py" \ 64 + if ! "$PYTHON" "${ROOT}/benchmarks/reranker/benchmark_reranker_random_titles.py" \
65 100,200,400,600,800,1000 \ 65 100,200,400,600,800,1000 \
66 --repeat 5 \ 66 --repeat 5 \
67 --seed 42 \ 67 --seed 42 \
@@ -82,7 +82,7 @@ run_one qwen3_vllm_score compact @@ -82,7 +82,7 @@ run_one qwen3_vllm_score compact
82 run_one qwen3_vllm_score standard 82 run_one qwen3_vllm_score standard
83 83
84 # Restore repo-default-style rerank settings (score + compact). 84 # Restore repo-default-style rerank settings (score + compact).
85 -"$PYTHON" "${ROOT}/scripts/patch_rerank_vllm_benchmark_config.py" \ 85 +"$PYTHON" "${ROOT}/benchmarks/reranker/patch_rerank_vllm_benchmark_config.py" \
86 --backend qwen3_vllm_score --instruction-format compact 86 --backend qwen3_vllm_score --instruction-format compact
87 "${ROOT}/restart.sh" reranker 87 "${ROOT}/restart.sh" reranker
88 wait_health qwen3_vllm_score compact 88 wait_health qwen3_vllm_score compact
scripts/smoke_qwen3_vllm_score_backend.py renamed to benchmarks/reranker/smoke_qwen3_vllm_score_backend.py
@@ -3,7 +3,7 @@ @@ -3,7 +3,7 @@
3 Smoke test: load Qwen3VLLMScoreRerankerBackend (must run as a file, not stdin — vLLM spawn). 3 Smoke test: load Qwen3VLLMScoreRerankerBackend (must run as a file, not stdin — vLLM spawn).
4 4
5 Usage (from repo root, score venv): 5 Usage (from repo root, score venv):
6 - PYTHONPATH=. ./.venv-reranker-score/bin/python scripts/smoke_qwen3_vllm_score_backend.py 6 + PYTHONPATH=. ./.venv-reranker-score/bin/python benchmarks/reranker/smoke_qwen3_vllm_score_backend.py
7 7
8 Same as production: vLLM child processes need the venv's ``bin`` on PATH (for pip's ``ninja`` when 8 Same as production: vLLM child processes need the venv's ``bin`` on PATH (for pip's ``ninja`` when
9 vLLM auto-selects FLASHINFER on T4/Turing). ``start_reranker.sh`` exports that; this script prepends 9 vLLM auto-selects FLASHINFER on T4/Turing). ``start_reranker.sh`` exports that; this script prepends
@@ -20,8 +20,8 @@ import sys @@ -20,8 +20,8 @@ import sys
20 import sysconfig 20 import sysconfig
21 from pathlib import Path 21 from pathlib import Path
22 22
23 -# Repo root on sys.path when run as scripts/smoke_*.py  
24 -_ROOT = Path(__file__).resolve().parents[1] 23 +# Repo root on sys.path when run from benchmarks/reranker/.
  24 +_ROOT = Path(__file__).resolve().parents[2]
25 if str(_ROOT) not in sys.path: 25 if str(_ROOT) not in sys.path:
26 sys.path.insert(0, str(_ROOT)) 26 sys.path.insert(0, str(_ROOT))
27 27
scripts/benchmark_nllb_t4_tuning.py renamed to benchmarks/translation/benchmark_nllb_t4_tuning.py
@@ -11,12 +11,12 @@ from datetime import datetime @@ -11,12 +11,12 @@ from datetime import datetime
11 from pathlib import Path 11 from pathlib import Path
12 from typing import Any, Dict, List, Tuple 12 from typing import Any, Dict, List, Tuple
13 13
14 -PROJECT_ROOT = Path(__file__).resolve().parent.parent 14 +PROJECT_ROOT = Path(__file__).resolve().parents[2]
15 if str(PROJECT_ROOT) not in sys.path: 15 if str(PROJECT_ROOT) not in sys.path:
16 sys.path.insert(0, str(PROJECT_ROOT)) 16 sys.path.insert(0, str(PROJECT_ROOT))
17 17
18 from config.services_config import get_translation_config 18 from config.services_config import get_translation_config
19 -from scripts.benchmark_translation_local_models import ( 19 +from benchmarks.translation.benchmark_translation_local_models import (
20 benchmark_concurrency_case, 20 benchmark_concurrency_case,
21 benchmark_serial_case, 21 benchmark_serial_case,
22 build_environment_info, 22 build_environment_info,
scripts/benchmark_translation_local_models.py renamed to benchmarks/translation/benchmark_translation_local_models.py
@@ -22,7 +22,7 @@ from typing import Any, Dict, Iterable, List, Sequence @@ -22,7 +22,7 @@ from typing import Any, Dict, Iterable, List, Sequence
22 import torch 22 import torch
23 import transformers 23 import transformers
24 24
25 -PROJECT_ROOT = Path(__file__).resolve().parent.parent 25 +PROJECT_ROOT = Path(__file__).resolve().parents[2]
26 if str(PROJECT_ROOT) not in sys.path: 26 if str(PROJECT_ROOT) not in sys.path:
27 sys.path.insert(0, str(PROJECT_ROOT)) 27 sys.path.insert(0, str(PROJECT_ROOT))
28 28
scripts/benchmark_translation_local_models_focus.py renamed to benchmarks/translation/benchmark_translation_local_models_focus.py
@@ -11,12 +11,12 @@ from datetime import datetime @@ -11,12 +11,12 @@ from datetime import datetime
11 from pathlib import Path 11 from pathlib import Path
12 from typing import Any, Dict, List 12 from typing import Any, Dict, List
13 13
14 -PROJECT_ROOT = Path(__file__).resolve().parent.parent 14 +PROJECT_ROOT = Path(__file__).resolve().parents[2]
15 if str(PROJECT_ROOT) not in sys.path: 15 if str(PROJECT_ROOT) not in sys.path:
16 sys.path.insert(0, str(PROJECT_ROOT)) 16 sys.path.insert(0, str(PROJECT_ROOT))
17 17
18 from config.services_config import get_translation_config 18 from config.services_config import get_translation_config
19 -from scripts.benchmark_translation_local_models import ( 19 +from benchmarks.translation.benchmark_translation_local_models import (
20 SCENARIOS, 20 SCENARIOS,
21 benchmark_concurrency_case, 21 benchmark_concurrency_case,
22 benchmark_serial_case, 22 benchmark_serial_case,
scripts/benchmark_translation_longtext_single.py renamed to benchmarks/translation/benchmark_translation_longtext_single.py
@@ -13,7 +13,7 @@ from pathlib import Path @@ -13,7 +13,7 @@ from pathlib import Path
13 13
14 import torch 14 import torch
15 15
16 -PROJECT_ROOT = Path(__file__).resolve().parent.parent 16 +PROJECT_ROOT = Path(__file__).resolve().parents[2]
17 17
18 import sys 18 import sys
19 19
config/config.yaml
1 -# Unified Configuration for Multi-Tenant Search Engine  
2 -# 统一配置文件,所有租户共用一套配置  
3 -# 注意:索引结构由 mappings/search_products.json 定义,此文件只配置搜索行为  
4 -#  
5 -# 约定:下列键为必填;进程环境变量可覆盖 infrastructure / runtime 中同名语义项  
6 -#(如 ES_HOST、API_PORT 等),未设置环境变量时使用本文件中的值。  
7 -  
8 -# Process / bind addresses (环境变量 APP_ENV、RUNTIME_ENV、ES_INDEX_NAMESPACE 可覆盖前两者的语义)  
9 runtime: 1 runtime:
10 environment: prod 2 environment: prod
11 index_namespace: '' 3 index_namespace: ''
@@ -21,8 +13,6 @@ runtime: @@ -21,8 +13,6 @@ runtime:
21 translator_port: 6006 13 translator_port: 6006
22 reranker_host: 0.0.0.0 14 reranker_host: 0.0.0.0
23 reranker_port: 6007 15 reranker_port: 6007
24 -  
25 -# 基础设施连接(敏感项优先读环境变量:ES_*、REDIS_*、DB_*、DASHSCOPE_API_KEY、DEEPL_AUTH_KEY)  
26 infrastructure: 16 infrastructure:
27 elasticsearch: 17 elasticsearch:
28 host: http://localhost:9200 18 host: http://localhost:9200
@@ -49,23 +39,12 @@ infrastructure: @@ -49,23 +39,12 @@ infrastructure:
49 secrets: 39 secrets:
50 dashscope_api_key: null 40 dashscope_api_key: null
51 deepl_auth_key: null 41 deepl_auth_key: null
52 -  
53 -# Elasticsearch Index  
54 es_index_name: search_products 42 es_index_name: search_products
55 -  
56 -# 检索域 / 索引列表(可为空列表;每项字段均需显式给出)  
57 indexes: [] 43 indexes: []
58 -  
59 -# Config assets  
60 assets: 44 assets:
61 query_rewrite_dictionary_path: config/dictionaries/query_rewrite.dict 45 query_rewrite_dictionary_path: config/dictionaries/query_rewrite.dict
62 -  
63 -# Product content understanding (LLM enrich-content) configuration  
64 product_enrich: 46 product_enrich:
65 max_workers: 40 47 max_workers: 40
66 -  
67 -# 离线 / Web 相关性评估(scripts/evaluation、eval-web)  
68 -# CLI 未显式传参时使用此处默认值;search_base_url 未配置时自动为 http://127.0.0.1:{runtime.api_port}  
69 search_evaluation: 48 search_evaluation:
70 artifact_root: artifacts/search_evaluation 49 artifact_root: artifacts/search_evaluation
71 queries_file: scripts/evaluation/queries/queries.txt 50 queries_file: scripts/evaluation/queries/queries.txt
@@ -74,10 +53,10 @@ search_evaluation: @@ -74,10 +53,10 @@ search_evaluation:
74 search_base_url: '' 53 search_base_url: ''
75 web_host: 0.0.0.0 54 web_host: 0.0.0.0
76 web_port: 6010 55 web_port: 6010
77 - judge_model: qwen3.5-plus 56 + judge_model: qwen3.6-plus
78 judge_enable_thinking: false 57 judge_enable_thinking: false
79 judge_dashscope_batch: false 58 judge_dashscope_batch: false
80 - intent_model: qwen3-max 59 + intent_model: qwen3.6-plus
81 intent_enable_thinking: true 60 intent_enable_thinking: true
82 judge_batch_completion_window: 24h 61 judge_batch_completion_window: 24h
83 judge_batch_poll_interval_sec: 10.0 62 judge_batch_poll_interval_sec: 10.0
@@ -98,20 +77,17 @@ search_evaluation: @@ -98,20 +77,17 @@ search_evaluation:
98 rebuild_irrelevant_stop_ratio: 0.799 77 rebuild_irrelevant_stop_ratio: 0.799
99 rebuild_irrel_low_combined_stop_ratio: 0.959 78 rebuild_irrel_low_combined_stop_ratio: 0.959
100 rebuild_irrelevant_stop_streak: 3 79 rebuild_irrelevant_stop_streak: 3
101 -  
102 -# ES Index Settings (基础设置)  
103 es_settings: 80 es_settings:
104 number_of_shards: 1 81 number_of_shards: 1
105 number_of_replicas: 0 82 number_of_replicas: 0
106 refresh_interval: 30s 83 refresh_interval: 30s
107 84
108 -# 字段权重配置(用于搜索时的字段boost)  
109 -# 统一按“字段基名”配置;查询时按实际检索语言动态拼接 .{lang}。  
110 -# 若需要按某个语言单独调权,也可以加显式 key(例如 title.de: 3.2)。 85 +# 统一按“字段基名”配置;查询时按实际检索语言动态拼接 .{lang}
111 field_boosts: 86 field_boosts:
112 title: 3.0 87 title: 3.0
113 - qanchors: 1.8  
114 - enriched_tags: 1.8 88 + # qanchors enriched_tags 在 enriched_attributes.value中也存在,所以其实他的权重为自身权重+enriched_attributes.value的权重
  89 + qanchors: 1.0
  90 + enriched_tags: 1.0
115 enriched_attributes.value: 1.5 91 enriched_attributes.value: 1.5
116 category_name_text: 2.0 92 category_name_text: 2.0
117 category_path: 2.0 93 category_path: 2.0
@@ -124,38 +100,25 @@ field_boosts: @@ -124,38 +100,25 @@ field_boosts:
124 description: 1.0 100 description: 1.0
125 vendor: 1.0 101 vendor: 1.0
126 102
127 -# Query Configuration(查询配置)  
128 query_config: 103 query_config:
129 - # 支持的语言  
130 supported_languages: 104 supported_languages:
131 - zh 105 - zh
132 - en 106 - en
133 default_language: en 107 default_language: en
134 -  
135 - # 功能开关(翻译开关由tenant_config控制)  
136 enable_text_embedding: true 108 enable_text_embedding: true
137 enable_query_rewrite: true 109 enable_query_rewrite: true
138 110
139 - # 查询翻译模型(须与 services.translation.capabilities 中某项一致)  
140 - # 源语种在租户 index_languages 内:主召回可打在源语种字段,用下面三项。  
141 - zh_to_en_model: nllb-200-distilled-600m # "opus-mt-zh-en"  
142 - en_to_zh_model: nllb-200-distilled-600m # "opus-mt-en-zh"  
143 - default_translation_model: nllb-200-distilled-600m  
144 - # zh_to_en_model: deepl  
145 - # en_to_zh_model: deepl  
146 - # default_translation_model: deepl  
147 - # 源语种不在 index_languages:翻译对可检索文本更关键,可单独指定(缺省则与上一组相同)  
148 - zh_to_en_model__source_not_in_index: nllb-200-distilled-600m  
149 - en_to_zh_model__source_not_in_index: nllb-200-distilled-600m  
150 - default_translation_model__source_not_in_index: nllb-200-distilled-600m  
151 - # zh_to_en_model__source_not_in_index: deepl  
152 - # en_to_zh_model__source_not_in_index: deepl  
153 - # default_translation_model__source_not_in_index: deepl 111 + zh_to_en_model: deepl # nllb-200-distilled-600m
  112 + en_to_zh_model: deepl
  113 + default_translation_model: deepl
  114 + # 源语种不在 index_languages时翻译质量比较重要,因此单独配置
  115 + zh_to_en_model__source_not_in_index: deepl
  116 + en_to_zh_model__source_not_in_index: deepl
  117 + default_translation_model__source_not_in_index: deepl
154 118
155 - # 查询解析阶段:翻译与 query 向量并发执行,共用同一等待预算(毫秒)。  
156 - # 检测语言已在租户 index_languages 内:较短;不在索引语言内:较长(翻译对召回更关键)。  
157 - translation_embedding_wait_budget_ms_source_in_index: 300 # 80  
158 - translation_embedding_wait_budget_ms_source_not_in_index: 400 # 200 119 + # 查询解析阶段:翻译与 query 向量并发执行,共用同一等待预算(毫秒)
  120 + translation_embedding_wait_budget_ms_source_in_index: 300
  121 + translation_embedding_wait_budget_ms_source_not_in_index: 400
159 style_intent: 122 style_intent:
160 enabled: true 123 enabled: true
161 selected_sku_boost: 1.2 124 selected_sku_boost: 1.2
@@ -182,17 +145,15 @@ query_config: @@ -182,17 +145,15 @@ query_config:
182 product_title_exclusion: 145 product_title_exclusion:
183 enabled: true 146 enabled: true
184 dictionary_path: config/dictionaries/product_title_exclusion.tsv 147 dictionary_path: config/dictionaries/product_title_exclusion.tsv
185 -  
186 - # 动态多语言检索字段配置  
187 - # multilingual_fields 会被拼成 title.{lang}/brief.{lang}/... 形式;  
188 - # shared_fields 为无语言后缀字段。  
189 search_fields: 148 search_fields:
  149 + # 统一按“字段基名”配置;查询时按实际检索语言动态拼接 .{lang}
190 multilingual_fields: 150 multilingual_fields:
191 - title 151 - title
192 - keywords 152 - keywords
193 - qanchors 153 - qanchors
194 - enriched_tags 154 - enriched_tags
195 - enriched_attributes.value 155 - enriched_attributes.value
  156 + # - enriched_taxonomy_attributes.value
196 - option1_values 157 - option1_values
197 - option2_values 158 - option2_values
198 - option3_values 159 - option3_values
@@ -202,13 +163,14 @@ query_config: @@ -202,13 +163,14 @@ query_config:
202 # - description 163 # - description
203 # - vendor 164 # - vendor
204 # shared_fields: 无语言后缀字段;示例: tags, option1_values, option2_values, option3_values 165 # shared_fields: 无语言后缀字段;示例: tags, option1_values, option2_values, option3_values
  166 +
205 shared_fields: null 167 shared_fields: null
206 core_multilingual_fields: 168 core_multilingual_fields:
207 - title 169 - title
208 - qanchors 170 - qanchors
209 - category_name_text 171 - category_name_text
210 172
211 - # 统一文本召回策略(主查询 + 翻译查询) 173 + # 文本召回(主查询 + 翻译查询)
212 text_query_strategy: 174 text_query_strategy:
213 base_minimum_should_match: 60% 175 base_minimum_should_match: 60%
214 translation_minimum_should_match: 60% 176 translation_minimum_should_match: 60%
@@ -223,14 +185,10 @@ query_config: @@ -223,14 +185,10 @@ query_config:
223 title: 5.0 185 title: 5.0
224 qanchors: 4.0 186 qanchors: 4.0
225 phrase_match_boost: 3.0 187 phrase_match_boost: 3.0
226 -  
227 - # Embedding字段名称  
228 text_embedding_field: title_embedding 188 text_embedding_field: title_embedding
229 image_embedding_field: image_embedding.vector 189 image_embedding_field: image_embedding.vector
230 190
231 - # 返回字段配置(_source includes)  
232 - # null表示返回所有字段,[]表示不返回任何字段,列表表示只返回指定字段  
233 - # 下列字段与 api/result_formatter.py(SpuResult 填充)及 search/searcher.py(SKU 排序/主图替换)一致 191 + # null表示返回所有字段,[]表示不返回任何字段
234 source_fields: 192 source_fields:
235 - spu_id 193 - spu_id
236 - handle 194 - handle
@@ -251,6 +209,8 @@ query_config: @@ -251,6 +209,8 @@ query_config:
251 # - qanchors 209 # - qanchors
252 # - enriched_tags 210 # - enriched_tags
253 # - enriched_attributes 211 # - enriched_attributes
  212 + # - # enriched_taxonomy_attributes.value
  213 +
254 - min_price 214 - min_price
255 - compare_at_price 215 - compare_at_price
256 - image_url 216 - image_url
@@ -270,26 +230,21 @@ query_config: @@ -270,26 +230,21 @@ query_config:
270 # KNN:文本向量与多模态(图片)向量各自 boost 与召回(k / num_candidates) 230 # KNN:文本向量与多模态(图片)向量各自 boost 与召回(k / num_candidates)
271 knn_text_boost: 4 231 knn_text_boost: 4
272 knn_image_boost: 4 232 knn_image_boost: 4
273 -  
274 - # knn_text_num_candidates = k * 3.4  
275 knn_text_k: 160 233 knn_text_k: 160
276 - knn_text_num_candidates: 560 234 + knn_text_num_candidates: 560 # k * 3.4
277 knn_text_k_long: 400 235 knn_text_k_long: 400
278 knn_text_num_candidates_long: 1200 236 knn_text_num_candidates_long: 1200
279 knn_image_k: 400 237 knn_image_k: 400
280 knn_image_num_candidates: 1200 238 knn_image_num_candidates: 1200
281 239
282 -# Function Score配置(ES层打分规则)  
283 function_score: 240 function_score:
284 score_mode: sum 241 score_mode: sum
285 boost_mode: multiply 242 boost_mode: multiply
286 functions: [] 243 functions: []
287 -  
288 -# 粗排配置(仅融合 ES 文本/向量信号,不调用模型)  
289 coarse_rank: 244 coarse_rank:
290 enabled: true 245 enabled: true
291 - input_window: 700  
292 - output_window: 240 246 + input_window: 480
  247 + output_window: 160
293 fusion: 248 fusion:
294 es_bias: 10.0 249 es_bias: 10.0
295 es_exponent: 0.05 250 es_exponent: 0.05
@@ -301,30 +256,29 @@ coarse_rank: @@ -301,30 +256,29 @@ coarse_rank:
301 knn_text_weight: 1.0 256 knn_text_weight: 1.0
302 knn_image_weight: 2.0 257 knn_image_weight: 2.0
303 knn_tie_breaker: 0.3 258 knn_tie_breaker: 0.3
304 - knn_bias: 0.6  
305 - knn_exponent: 0.4  
306 -  
307 -# 精排配置(轻量 reranker) 259 + knn_bias: 0.0
  260 + knn_exponent: 5.6
  261 + knn_text_exponent: 0.0
  262 + knn_image_exponent: 0.0
308 fine_rank: 263 fine_rank:
309 - enabled: false 264 + enabled: false # false 时保序透传
310 input_window: 160 265 input_window: 160
311 output_window: 80 266 output_window: 80
312 timeout_sec: 10.0 267 timeout_sec: 10.0
313 rerank_query_template: '{query}' 268 rerank_query_template: '{query}'
314 rerank_doc_template: '{title}' 269 rerank_doc_template: '{title}'
315 service_profile: fine 270 service_profile: fine
316 -  
317 -# 重排配置(provider/URL 在 services.rerank)  
318 rerank: 271 rerank:
319 - enabled: true 272 + enabled: false # false 时保序透传
320 rerank_window: 160 273 rerank_window: 160
  274 + exact_knn_rescore_enabled: true
  275 + exact_knn_rescore_window: 160
321 timeout_sec: 15.0 276 timeout_sec: 15.0
322 weight_es: 0.4 277 weight_es: 0.4
323 weight_ai: 0.6 278 weight_ai: 0.6
324 rerank_query_template: '{query}' 279 rerank_query_template: '{query}'
325 rerank_doc_template: '{title}' 280 rerank_doc_template: '{title}'
326 service_profile: default 281 service_profile: default
327 -  
328 # 乘法融合:fused = Π (max(score,0) + bias) ** exponent(es / rerank / fine / text / knn) 282 # 乘法融合:fused = Π (max(score,0) + bias) ** exponent(es / rerank / fine / text / knn)
329 # 其中 knn_score 先做一层 dis_max: 283 # 其中 knn_score 先做一层 dis_max:
330 # max(knn_text_weight * text_knn, knn_image_weight * image_knn) 284 # max(knn_text_weight * text_knn, knn_image_weight * image_knn)
@@ -337,30 +291,28 @@ rerank: @@ -337,30 +291,28 @@ rerank:
337 fine_bias: 0.1 291 fine_bias: 0.1
338 fine_exponent: 1.0 292 fine_exponent: 1.0
339 text_bias: 0.1 293 text_bias: 0.1
340 - text_exponent: 0.25  
341 # base_query_trans_* 相对 base_query 的权重(见 search/rerank_client 中文本 dismax 融合) 294 # base_query_trans_* 相对 base_query 的权重(见 search/rerank_client 中文本 dismax 融合)
  295 + text_exponent: 0.25
342 text_translation_weight: 0.8 296 text_translation_weight: 0.8
343 knn_text_weight: 1.0 297 knn_text_weight: 1.0
344 knn_image_weight: 2.0 298 knn_image_weight: 2.0
345 knn_tie_breaker: 0.3 299 knn_tie_breaker: 0.3
346 - knn_bias: 0.6  
347 - knn_exponent: 0.4 300 + knn_bias: 0.0
  301 + knn_exponent: 5.6
348 302
349 -# 可扩展服务/provider 注册表(单一配置源)  
350 services: 303 services:
351 translation: 304 translation:
352 service_url: http://127.0.0.1:6006 305 service_url: http://127.0.0.1:6006
353 - # default_model: nllb-200-distilled-600m  
354 default_model: nllb-200-distilled-600m 306 default_model: nllb-200-distilled-600m
355 default_scene: general 307 default_scene: general
356 timeout_sec: 10.0 308 timeout_sec: 10.0
357 cache: 309 cache:
358 ttl_seconds: 62208000 310 ttl_seconds: 62208000
359 sliding_expiration: true 311 sliding_expiration: true
360 - # When false, cache keys are exact-match per request model only (ignores model_quality_tiers for lookups).  
361 - enable_model_quality_tier_cache: true 312 + # When false, cache keys are exact-match per request model only (ignores model_quality_tiers for lookups)
362 # Higher tier = better quality. Multiple models may share one tier (同级). 313 # Higher tier = better quality. Multiple models may share one tier (同级).
363 # A request may reuse Redis keys from models with tier > A or tier == A (not from lower tiers). 314 # A request may reuse Redis keys from models with tier > A or tier == A (not from lower tiers).
  315 + enable_model_quality_tier_cache: true
364 model_quality_tiers: 316 model_quality_tiers:
365 deepl: 30 317 deepl: 30
366 qwen-mt: 30 318 qwen-mt: 30
@@ -454,13 +406,12 @@ services: @@ -454,13 +406,12 @@ services:
454 num_beams: 1 406 num_beams: 1
455 use_cache: true 407 use_cache: true
456 embedding: 408 embedding:
457 - provider: http # http 409 + provider: http
458 providers: 410 providers:
459 http: 411 http:
460 text_base_url: http://127.0.0.1:6005 412 text_base_url: http://127.0.0.1:6005
461 image_base_url: http://127.0.0.1:6008 413 image_base_url: http://127.0.0.1:6008
462 - # 服务内文本后端(embedding 进程启动时读取)  
463 - backend: tei # tei | local_st 414 + backend: tei
464 backends: 415 backends:
465 tei: 416 tei:
466 base_url: http://127.0.0.1:8080 417 base_url: http://127.0.0.1:8080
@@ -500,13 +451,13 @@ services: @@ -500,13 +451,13 @@ services:
500 request: 451 request:
501 max_docs: 1000 452 max_docs: 1000
502 normalize: true 453 normalize: true
503 - default_instance: default  
504 # 命名实例:同一套 reranker 代码按实例名读取不同端口 / 后端 / runtime 目录。 454 # 命名实例:同一套 reranker 代码按实例名读取不同端口 / 后端 / runtime 目录。
  455 + default_instance: default
505 instances: 456 instances:
506 default: 457 default:
507 host: 0.0.0.0 458 host: 0.0.0.0
508 port: 6007 459 port: 6007
509 - backend: qwen3_vllm_score 460 + backend: bge
510 runtime_dir: ./.runtime/reranker/default 461 runtime_dir: ./.runtime/reranker/default
511 fine: 462 fine:
512 host: 0.0.0.0 463 host: 0.0.0.0
@@ -543,6 +494,7 @@ services: @@ -543,6 +494,7 @@ services:
543 enforce_eager: false 494 enforce_eager: false
544 infer_batch_size: 100 495 infer_batch_size: 100
545 sort_by_doc_length: true 496 sort_by_doc_length: true
  497 +
546 # standard=_format_instruction__standard(固定 yes/no system);compact=_format_instruction(instruction 作 system 且 user 内重复 Instruct) 498 # standard=_format_instruction__standard(固定 yes/no system);compact=_format_instruction(instruction 作 system 且 user 内重复 Instruct)
547 instruction_format: standard # compact standard 499 instruction_format: standard # compact standard
548 # instruction: "Given a query, score the product for relevance" 500 # instruction: "Given a query, score the product for relevance"
@@ -556,6 +508,7 @@ services: @@ -556,6 +508,7 @@ services:
556 # instruction: "Rank products by query with category & style match prioritized" 508 # instruction: "Rank products by query with category & style match prioritized"
557 # instruction: "Given a fashion shopping query, retrieve relevant products that answer the query" 509 # instruction: "Given a fashion shopping query, retrieve relevant products that answer the query"
558 instruction: rank products by given query 510 instruction: rank products by given query
  511 +
559 # vLLM LLM.score()(跨编码打分)。独立高性能环境 .venv-reranker-score(vllm 0.18 固定版):./scripts/setup_reranker_venv.sh qwen3_vllm_score 512 # vLLM LLM.score()(跨编码打分)。独立高性能环境 .venv-reranker-score(vllm 0.18 固定版):./scripts/setup_reranker_venv.sh qwen3_vllm_score
560 # 与 qwen3_vllm 可共用同一 model_name / HF 缓存;venv 分离以便升级 vLLM 而不影响 generate 后端。 513 # 与 qwen3_vllm 可共用同一 model_name / HF 缓存;venv 分离以便升级 vLLM 而不影响 generate 后端。
561 qwen3_vllm_score: 514 qwen3_vllm_score:
@@ -583,15 +536,10 @@ services: @@ -583,15 +536,10 @@ services:
583 qwen3_transformers: 536 qwen3_transformers:
584 model_name: Qwen/Qwen3-Reranker-0.6B 537 model_name: Qwen/Qwen3-Reranker-0.6B
585 instruction: rank products by given query 538 instruction: rank products by given query
586 - # instruction: "Score the product’s relevance to the given query"  
587 max_length: 8192 539 max_length: 8192
588 batch_size: 64 540 batch_size: 64
589 use_fp16: true 541 use_fp16: true
590 - # sdpa:默认无需 flash-attn;若已安装 flash_attn 可改为 flash_attention_2  
591 attn_implementation: sdpa 542 attn_implementation: sdpa
592 - # Packed Transformers backend: shared query prefix + custom position_ids/attention_mask.  
593 - # For 1 query + many short docs (for example 400 product titles), this usually reduces  
594 - # repeated prefix work and padding waste compared with pairwise batching.  
595 qwen3_transformers_packed: 543 qwen3_transformers_packed:
596 model_name: Qwen/Qwen3-Reranker-0.6B 544 model_name: Qwen/Qwen3-Reranker-0.6B
597 instruction: Rank products by query with category & style match prioritized 545 instruction: Rank products by query with category & style match prioritized
@@ -600,8 +548,6 @@ services: @@ -600,8 +548,6 @@ services:
600 max_docs_per_pack: 0 548 max_docs_per_pack: 0
601 use_fp16: true 549 use_fp16: true
602 sort_by_doc_length: true 550 sort_by_doc_length: true
603 - # Packed mode relies on a custom 4D attention mask. "eager" is the safest default.  
604 - # If your torch/transformers stack validates it, you can benchmark "sdpa".  
605 attn_implementation: eager 551 attn_implementation: eager
606 qwen3_gguf: 552 qwen3_gguf:
607 repo_id: DevQuasar/Qwen.Qwen3-Reranker-4B-GGUF 553 repo_id: DevQuasar/Qwen.Qwen3-Reranker-4B-GGUF
@@ -609,7 +555,6 @@ services: @@ -609,7 +555,6 @@ services:
609 cache_dir: ./model_cache 555 cache_dir: ./model_cache
610 local_dir: ./models/reranker/qwen3-reranker-4b-gguf 556 local_dir: ./models/reranker/qwen3-reranker-4b-gguf
611 instruction: Rank products by query with category & style match prioritized 557 instruction: Rank products by query with category & style match prioritized
612 - # T4 16GB / 性能优先配置:全量层 offload,实测比保守配置明显更快  
613 n_ctx: 512 558 n_ctx: 512
614 n_batch: 512 559 n_batch: 512
615 n_ubatch: 512 560 n_ubatch: 512
@@ -632,8 +577,6 @@ services: @@ -632,8 +577,6 @@ services:
632 cache_dir: ./model_cache 577 cache_dir: ./model_cache
633 local_dir: ./models/reranker/qwen3-reranker-0.6b-q8_0-gguf 578 local_dir: ./models/reranker/qwen3-reranker-0.6b-q8_0-gguf
634 instruction: Rank products by query with category & style match prioritized 579 instruction: Rank products by query with category & style match prioritized
635 - # 0.6B GGUF / online rerank baseline:  
636 - # 实测 400 titles 单请求约 265s,因此它更适合作为低显存功能后备,不适合在线低延迟主路由。  
637 n_ctx: 256 580 n_ctx: 256
638 n_batch: 256 581 n_batch: 256
639 n_ubatch: 256 582 n_ubatch: 256
@@ -653,20 +596,15 @@ services: @@ -653,20 +596,15 @@ services:
653 verbose: false 596 verbose: false
654 dashscope_rerank: 597 dashscope_rerank:
655 model_name: qwen3-rerank 598 model_name: qwen3-rerank
656 - # 按地域选择 endpoint:  
657 - # 中国: https://dashscope.aliyuncs.com/compatible-api/v1/reranks  
658 - # 新加坡: https://dashscope-intl.aliyuncs.com/compatible-api/v1/reranks  
659 - # 美国: https://dashscope-us.aliyuncs.com/compatible-api/v1/reranks  
660 endpoint: https://dashscope.aliyuncs.com/compatible-api/v1/reranks 599 endpoint: https://dashscope.aliyuncs.com/compatible-api/v1/reranks
661 api_key_env: RERANK_DASHSCOPE_API_KEY_CN 600 api_key_env: RERANK_DASHSCOPE_API_KEY_CN
662 timeout_sec: 10.0 601 timeout_sec: 10.0
663 - top_n_cap: 0 # 0 表示 top_n=当前请求文档数;>0 则限制 top_n 上限  
664 - batchsize: 64 # 0 关闭;>0 启用并发小包调度(top_n/top_n_cap 仍生效,分包后全局截断) 602 + top_n_cap: 0 # 0 表示 top_n=当前请求文档数
  603 + batchsize: 64 # 0 关闭;>0 启用并发小包调度(top_n/top_n_cap 仍生效,分包后全局截断)
665 instruct: Given a shopping query, rank product titles by relevance 604 instruct: Given a shopping query, rank product titles by relevance
666 max_retries: 2 605 max_retries: 2
667 retry_backoff_sec: 0.2 606 retry_backoff_sec: 0.2
668 607
669 -# SPU配置(已启用,使用嵌套skus)  
670 spu_config: 608 spu_config:
671 enabled: true 609 enabled: true
672 spu_field: spu_id 610 spu_field: spu_id
@@ -678,7 +616,6 @@ spu_config: @@ -678,7 +616,6 @@ spu_config:
678 - option2 616 - option2
679 - option3 617 - option3
680 618
681 -# 租户配置(Tenant Configuration)  
682 # 每个租户可配置主语言 primary_language 与索引语言 index_languages(主市场语言,商家可勾选) 619 # 每个租户可配置主语言 primary_language 与索引语言 index_languages(主市场语言,商家可勾选)
683 # 默认 index_languages: [en, zh],可配置为任意 SOURCE_LANG_CODE_MAP.keys() 的子集 620 # 默认 index_languages: [en, zh],可配置为任意 SOURCE_LANG_CODE_MAP.keys() 的子集
684 tenant_config: 621 tenant_config:
@@ -587,6 +587,14 @@ class AppConfigLoader: @@ -587,6 +587,14 @@ class AppConfigLoader:
587 knn_tie_breaker=float(coarse_fusion_raw.get("knn_tie_breaker", 0.0)), 587 knn_tie_breaker=float(coarse_fusion_raw.get("knn_tie_breaker", 0.0)),
588 knn_bias=float(coarse_fusion_raw.get("knn_bias", 0.6)), 588 knn_bias=float(coarse_fusion_raw.get("knn_bias", 0.6)),
589 knn_exponent=float(coarse_fusion_raw.get("knn_exponent", 0.2)), 589 knn_exponent=float(coarse_fusion_raw.get("knn_exponent", 0.2)),
  590 + knn_text_bias=float(
  591 + coarse_fusion_raw.get("knn_text_bias", coarse_fusion_raw.get("knn_bias", 0.6))
  592 + ),
  593 + knn_text_exponent=float(coarse_fusion_raw.get("knn_text_exponent", 0.0)),
  594 + knn_image_bias=float(
  595 + coarse_fusion_raw.get("knn_image_bias", coarse_fusion_raw.get("knn_bias", 0.6))
  596 + ),
  597 + knn_image_exponent=float(coarse_fusion_raw.get("knn_image_exponent", 0.0)),
590 text_translation_weight=float( 598 text_translation_weight=float(
591 coarse_fusion_raw.get("text_translation_weight", 0.8) 599 coarse_fusion_raw.get("text_translation_weight", 0.8)
592 ), 600 ),
@@ -608,6 +616,12 @@ class AppConfigLoader: @@ -608,6 +616,12 @@ class AppConfigLoader:
608 rerank=RerankConfig( 616 rerank=RerankConfig(
609 enabled=bool(rerank_cfg.get("enabled", True)), 617 enabled=bool(rerank_cfg.get("enabled", True)),
610 rerank_window=int(rerank_cfg.get("rerank_window", 384)), 618 rerank_window=int(rerank_cfg.get("rerank_window", 384)),
  619 + exact_knn_rescore_enabled=bool(
  620 + rerank_cfg.get("exact_knn_rescore_enabled", False)
  621 + ),
  622 + exact_knn_rescore_window=int(
  623 + rerank_cfg.get("exact_knn_rescore_window", 0)
  624 + ),
611 timeout_sec=float(rerank_cfg.get("timeout_sec", 15.0)), 625 timeout_sec=float(rerank_cfg.get("timeout_sec", 15.0)),
612 weight_es=float(rerank_cfg.get("weight_es", 0.4)), 626 weight_es=float(rerank_cfg.get("weight_es", 0.4)),
613 weight_ai=float(rerank_cfg.get("weight_ai", 0.6)), 627 weight_ai=float(rerank_cfg.get("weight_ai", 0.6)),
@@ -630,6 +644,14 @@ class AppConfigLoader: @@ -630,6 +644,14 @@ class AppConfigLoader:
630 knn_tie_breaker=float(fusion_raw.get("knn_tie_breaker", 0.0)), 644 knn_tie_breaker=float(fusion_raw.get("knn_tie_breaker", 0.0)),
631 knn_bias=float(fusion_raw.get("knn_bias", 0.6)), 645 knn_bias=float(fusion_raw.get("knn_bias", 0.6)),
632 knn_exponent=float(fusion_raw.get("knn_exponent", 0.2)), 646 knn_exponent=float(fusion_raw.get("knn_exponent", 0.2)),
  647 + knn_text_bias=float(
  648 + fusion_raw.get("knn_text_bias", fusion_raw.get("knn_bias", 0.6))
  649 + ),
  650 + knn_text_exponent=float(fusion_raw.get("knn_text_exponent", 0.0)),
  651 + knn_image_bias=float(
  652 + fusion_raw.get("knn_image_bias", fusion_raw.get("knn_bias", 0.6))
  653 + ),
  654 + knn_image_exponent=float(fusion_raw.get("knn_image_exponent", 0.0)),
633 fine_bias=float(fusion_raw.get("fine_bias", 0.00001)), 655 fine_bias=float(fusion_raw.get("fine_bias", 0.00001)),
634 fine_exponent=float(fusion_raw.get("fine_exponent", 1.0)), 656 fine_exponent=float(fusion_raw.get("fine_exponent", 1.0)),
635 text_translation_weight=float( 657 text_translation_weight=float(
@@ -655,6 +677,14 @@ class AppConfigLoader: @@ -655,6 +677,14 @@ class AppConfigLoader:
655 677
656 translation_raw = raw.get("translation") if isinstance(raw.get("translation"), dict) else {} 678 translation_raw = raw.get("translation") if isinstance(raw.get("translation"), dict) else {}
657 normalized_translation = build_translation_config(translation_raw) 679 normalized_translation = build_translation_config(translation_raw)
  680 + local_translation_backends = {"local_nllb", "local_marian"}
  681 + for capability_name, capability_cfg in normalized_translation["capabilities"].items():
  682 + backend_name = str(capability_cfg.get("backend") or "").strip().lower()
  683 + if backend_name not in local_translation_backends:
  684 + continue
  685 + for path_key in ("model_dir", "ct2_model_dir"):
  686 + if capability_cfg.get(path_key) not in (None, ""):
  687 + capability_cfg[path_key] = str(self._resolve_project_path_value(capability_cfg[path_key]).resolve())
658 translation_config = TranslationServiceConfig( 688 translation_config = TranslationServiceConfig(
659 endpoint=str(normalized_translation["service_url"]).rstrip("/"), 689 endpoint=str(normalized_translation["service_url"]).rstrip("/"),
660 timeout_sec=float(normalized_translation["timeout_sec"]), 690 timeout_sec=float(normalized_translation["timeout_sec"]),
@@ -749,7 +779,7 @@ class AppConfigLoader: @@ -749,7 +779,7 @@ class AppConfigLoader:
749 port=port, 779 port=port,
750 backend=backend_name, 780 backend=backend_name,
751 runtime_dir=( 781 runtime_dir=(
752 - str(v) 782 + str(self._resolve_project_path_value(v).resolve())
753 if (v := instance_raw.get("runtime_dir")) not in (None, "") 783 if (v := instance_raw.get("runtime_dir")) not in (None, "")
754 else None 784 else None
755 ), 785 ),
@@ -787,6 +817,12 @@ class AppConfigLoader: @@ -787,6 +817,12 @@ class AppConfigLoader:
787 rerank=rerank_config, 817 rerank=rerank_config,
788 ) 818 )
789 819
  820 + def _resolve_project_path_value(self, value: Any) -> Path:
  821 + candidate = Path(str(value)).expanduser()
  822 + if candidate.is_absolute():
  823 + return candidate
  824 + return self.project_root / candidate
  825 +
790 def _build_tenants_config(self, raw: Dict[str, Any]) -> TenantCatalogConfig: 826 def _build_tenants_config(self, raw: Dict[str, Any]) -> TenantCatalogConfig:
791 if not isinstance(raw, dict): 827 if not isinstance(raw, dict):
792 raise ConfigurationError("tenant_config must be a mapping") 828 raise ConfigurationError("tenant_config must be a mapping")
@@ -119,6 +119,18 @@ class RerankFusionConfig: @@ -119,6 +119,18 @@ class RerankFusionConfig:
119 knn_tie_breaker: float = 0.0 119 knn_tie_breaker: float = 0.0
120 knn_bias: float = 0.6 120 knn_bias: float = 0.6
121 knn_exponent: float = 0.2 121 knn_exponent: float = 0.2
  122 + #: Optional additive floor for the weighted text KNN term.
  123 + #: Falls back to knn_bias when omitted in config loading.
  124 + knn_text_bias: float = 0.6
  125 + #: Optional extra multiplicative term on weighted text KNN.
  126 + #: Uses knn_text_bias as the additive floor.
  127 + knn_text_exponent: float = 0.0
  128 + #: Optional additive floor for the weighted image KNN term.
  129 + #: Falls back to knn_bias when omitted in config loading.
  130 + knn_image_bias: float = 0.6
  131 + #: Optional extra multiplicative term on weighted image KNN.
  132 + #: Uses knn_image_bias as the additive floor.
  133 + knn_image_exponent: float = 0.0
122 fine_bias: float = 0.00001 134 fine_bias: float = 0.00001
123 fine_exponent: float = 1.0 135 fine_exponent: float = 1.0
124 #: 翻译子句 named query 分数相对原文 base_query 的权重(加权后再与原文做 dismax 融合) 136 #: 翻译子句 named query 分数相对原文 base_query 的权重(加权后再与原文做 dismax 融合)
@@ -143,6 +155,18 @@ class CoarseRankFusionConfig: @@ -143,6 +155,18 @@ class CoarseRankFusionConfig:
143 knn_tie_breaker: float = 0.0 155 knn_tie_breaker: float = 0.0
144 knn_bias: float = 0.6 156 knn_bias: float = 0.6
145 knn_exponent: float = 0.2 157 knn_exponent: float = 0.2
  158 + #: Optional additive floor for the weighted text KNN term.
  159 + #: Falls back to knn_bias when omitted in config loading.
  160 + knn_text_bias: float = 0.6
  161 + #: Optional extra multiplicative term on weighted text KNN.
  162 + #: Uses knn_text_bias as the additive floor.
  163 + knn_text_exponent: float = 0.0
  164 + #: Optional additive floor for the weighted image KNN term.
  165 + #: Falls back to knn_bias when omitted in config loading.
  166 + knn_image_bias: float = 0.6
  167 + #: Optional extra multiplicative term on weighted image KNN.
  168 + #: Uses knn_image_bias as the additive floor.
  169 + knn_image_exponent: float = 0.0
146 #: 翻译子句 named query 分数相对原文 base_query 的权重(加权后再与原文做 dismax 融合) 170 #: 翻译子句 named query 分数相对原文 base_query 的权重(加权后再与原文做 dismax 融合)
147 text_translation_weight: float = 0.8 171 text_translation_weight: float = 0.8
148 172
@@ -176,6 +200,9 @@ class RerankConfig: @@ -176,6 +200,9 @@ class RerankConfig:
176 200
177 enabled: bool = True 201 enabled: bool = True
178 rerank_window: int = 384 202 rerank_window: int = 384
  203 + exact_knn_rescore_enabled: bool = False
  204 + #: topN exact vector scoring window; <=0 means "follow rerank_window"
  205 + exact_knn_rescore_window: int = 0
179 timeout_sec: float = 15.0 206 timeout_sec: float = 15.0
180 weight_es: float = 0.4 207 weight_es: float = 0.4
181 weight_ai: float = 0.6 208 weight_ai: float = 0.6
docs/DEVELOPER_GUIDE.md
@@ -389,7 +389,7 @@ services: @@ -389,7 +389,7 @@ services:
389 - **位置**:`tests/`,可按 `unit/`、`integration/` 或按模块划分子目录;公共 fixture 在 `conftest.py`。 389 - **位置**:`tests/`,可按 `unit/`、`integration/` 或按模块划分子目录;公共 fixture 在 `conftest.py`。
390 - **标记**:使用 `@pytest.mark.unit`、`@pytest.mark.integration`、`@pytest.mark.api` 等区分用例类型,便于按需运行。 390 - **标记**:使用 `@pytest.mark.unit`、`@pytest.mark.integration`、`@pytest.mark.api` 等区分用例类型,便于按需运行。
391 - **依赖**:单元测试通过 mock(如 `mock_es_client`、`sample_search_config`)不依赖真实 ES/DB;集成测试需在说明中注明依赖服务。 391 - **依赖**:单元测试通过 mock(如 `mock_es_client`、`sample_search_config`)不依赖真实 ES/DB;集成测试需在说明中注明依赖服务。
392 -- **运行**:`python -m pytest tests/`;仅单元:`python -m pytest tests/unit/` 或 `-m unit` 392 +- **运行**:`python -m pytest tests/`;推荐最小回归:`python -m pytest tests/ci -q`;按模块聚焦可直接指定具体测试文件
393 - **原则**:新增逻辑应有对应测试;修改协议或配置契约时更新相关测试与 fixture。 393 - **原则**:新增逻辑应有对应测试;修改协议或配置契约时更新相关测试与 fixture。
394 394
395 ### 8.3 配置与环境 395 ### 8.3 配置与环境
docs/QUICKSTART.md
@@ -69,7 +69,7 @@ source activate.sh @@ -69,7 +69,7 @@ source activate.sh
69 ./run.sh all 69 ./run.sh all
70 # 仅为薄封装:等价于 ./scripts/service_ctl.sh up all 70 # 仅为薄封装:等价于 ./scripts/service_ctl.sh up all
71 # 说明: 71 # 说明:
72 -# - all = tei cnclip embedding embedding-image translator reranker reranker-fine backend indexer frontend eval-web 72 +# - all = tei cnclip embedding embedding-image translator reranker backend indexer frontend eval-web
73 # - up 会同时启动 monitor daemon(运行期连续失败自动重启) 73 # - up 会同时启动 monitor daemon(运行期连续失败自动重启)
74 # - reranker 为 GPU 强制模式(资源不足会直接启动失败) 74 # - reranker 为 GPU 强制模式(资源不足会直接启动失败)
75 # - TEI 默认使用 GPU;当 TEI_DEVICE=cuda 且 GPU 不可用时会直接失败(不会自动降级到 CPU) 75 # - TEI 默认使用 GPU;当 TEI_DEVICE=cuda 且 GPU 不可用时会直接失败(不会自动降级到 CPU)
@@ -166,7 +166,7 @@ curl -X POST http://localhost:6008/embed/image \ @@ -166,7 +166,7 @@ curl -X POST http://localhost:6008/embed/image \
166 166
167 ```bash 167 ```bash
168 ./scripts/setup_translator_venv.sh 168 ./scripts/setup_translator_venv.sh
169 -./.venv-translator/bin/python scripts/download_translation_models.py --all-local # 如需本地模型 169 +./.venv-translator/bin/python scripts/translation/download_translation_models.py --all-local # 如需本地模型
170 ./scripts/start_translator.sh 170 ./scripts/start_translator.sh
171 171
172 curl -X POST http://localhost:6006/translate \ 172 curl -X POST http://localhost:6006/translate \
docs/Usage-Guide.md
@@ -126,7 +126,7 @@ cd /data/saas-search @@ -126,7 +126,7 @@ cd /data/saas-search
126 126
127 这个脚本会自动: 127 这个脚本会自动:
128 1. 创建日志目录 128 1. 创建日志目录
129 -2. 按目标启动服务(`all`:`tei cnclip embedding embedding-image translator reranker reranker-fine backend indexer frontend eval-web`) 129 +2. 按目标启动服务(`all`:`tei cnclip embedding embedding-image translator reranker backend indexer frontend eval-web`)
130 3. 写入 PID 到 `logs/*.pid` 130 3. 写入 PID 到 `logs/*.pid`
131 4. 执行健康检查 131 4. 执行健康检查
132 5. 启动 monitor daemon(运行期连续失败自动重启) 132 5. 启动 monitor daemon(运行期连续失败自动重启)
@@ -202,7 +202,7 @@ python -m pytest -q tests/test_rerank_client.py tests/test_es_query_builder.py t @@ -202,7 +202,7 @@ python -m pytest -q tests/test_rerank_client.py tests/test_es_query_builder.py t
202 ./scripts/service_ctl.sh restart backend 202 ./scripts/service_ctl.sh restart backend
203 sleep 3 203 sleep 3
204 ./scripts/service_ctl.sh status backend 204 ./scripts/service_ctl.sh status backend
205 -./scripts/evaluation/start_eval.sh.sh batch 205 +./scripts/evaluation/start_eval.sh batch
206 ``` 206 ```
207 207
208 离线批量评估会把标注与报表写到 `artifacts/search_evaluation/`(SQLite、`batch_reports/` 下的 JSON/Markdown 等)。说明与命令见 [scripts/evaluation/README.md](../scripts/evaluation/README.md)。 208 离线批量评估会把标注与报表写到 `artifacts/search_evaluation/`(SQLite、`batch_reports/` 下的 JSON/Markdown 等)。说明与命令见 [scripts/evaluation/README.md](../scripts/evaluation/README.md)。
docs/caches-inventory.md 0 → 100644
@@ -0,0 +1,133 @@ @@ -0,0 +1,133 @@
  1 +# 本项目缓存一览
  2 +
  3 +本文档梳理仓库内**与业务相关的各类缓存**:说明用途、键与过期策略,并汇总运维脚本。按「分布式(Redis)→ 进程内 → 磁盘/模型 → 第三方」组织。
  4 +
  5 +---
  6 +
  7 +## 一、Redis 集中式缓存(生产主路径)
  8 +
  9 +所有下列缓存默认连接 **`infrastructure.redis`**(`config/config.yaml` 与 `REDIS_*` 环境变量),**数据库编号一般为 `db=0`**(脚本可通过参数覆盖)。`snapshot_db` 仅在配置中存在,供快照/运维场景选用,应用代码未按该字段切换业务缓存的 DB。
  10 +
  11 +### 1. 文本 / 图像向量缓存(Embedding)
  12 +
  13 +- **作用**:缓存 BGE/TEI 文本向量与 CN-CLIP 图像向量、CLIP 文本塔向量,避免重复推理。
  14 +- **实现**:`embeddings/redis_embedding_cache.py` 的 `RedisEmbeddingCache`;键构造见 `embeddings/cache_keys.py`。
  15 +- **Key 形态**(最终 Redis 键 = `前缀` + `可选 namespace` + `逻辑键`):
  16 + - **前缀**:`infrastructure.redis.embedding_cache_prefix`(默认 `embedding`,可用 `REDIS_EMBEDDING_CACHE_PREFIX` 覆盖)。
  17 + - **命名空间**:`embeddings/server.py` 与客户端中分为:
  18 + - 文本:`namespace=""` → `{prefix}:{embed:norm0|1:...}`
  19 + - 图像:`namespace="image"` → `{prefix}:image:{embed:模型名:txt:norm0|1:...}`
  20 + - CLIP 文本:`namespace="clip_text"` → `{prefix}:clip_text:{embed:模型名:img:norm0|1:...}`
  21 + - 逻辑键段含 `embed:`、`norm0/1`、模型名(多模态)、过长文本/URL 时用 `h:sha256:...` 摘要(见 `cache_keys.py` 注释)。
  22 +- **值格式**:BF16 压缩后的字节(`embeddings/bf16.py`),非 JSON。
  23 +- **TTL**:`infrastructure.redis.cache_expire_days`(默认 **720 天**,`REDIS_CACHE_EXPIRE_DAYS`)。写入用 `SETEX`;**命中时滑动续期**(`EXPIRE` 刷新为同一时长)。
  24 +- **Redis 客户端**:`decode_responses=False`(二进制)。
  25 +
  26 +**主要代码**:`embeddings/server.py`、`embeddings/text_encoder.py`、`embeddings/image_encoder.py`。
  27 +
  28 +---
  29 +
  30 +### 2. 翻译结果缓存(Translation)
  31 +
  32 +- **作用**:按「翻译模型 + 目标语言 + 原文」缓存译文;支持**模型质量分层探测**(高 tier 模型写入的缓存可被同 tier 或更高 tier 的请求命中,见 `translation/settings.py` 中 `translation_cache_probe_models`)。
  33 +- **Key 形态**:`trans:{model}:{target_lang}:{text前4字符}{sha256全文}`(`translation/cache.py` 的 `build_key`)。
  34 +- **值格式**:UTF-8 译文字符串。
  35 +- **TTL**:`services.translation.cache.ttl_seconds`(默认 **62208000 秒 = 720 天**)。若 `sliding_expiration: true`,命中时刷新 TTL。
  36 +- **能力级开关**:各 `capabilities.*.use_cache` 为 `false` 时该后端不落 Redis。
  37 +- **Redis 客户端**:`decode_responses=True`。
  38 +
  39 +**主要代码**:`translation/cache.py`、`translation/service.py`;翻译 HTTP 服务:`api/translator_app.py`(`get_translation_service()` 使用 `lru_cache` 单例,见下文进程内缓存)。
  40 +
  41 +---
  42 +
  43 +### 3. 商品内容理解 / Anchors 与语义分析缓存(Indexer)
  44 +
  45 +- **作用**:缓存 LLM 对商品标题等拼出的 **prompt 输入** 所做的分析结果(anchors、语义属性等),避免重复调用大模型。键与 `analysis_kind`、`prompt` 契约版本、`target_lang` 及输入摘要相关。
  46 +- **Key 形态**:`{anchor_cache_prefix}:{analysis_kind}:{prompt_contract_hash[:12]}:{target_lang}:{prompt_input[:4]}{md5}`(`indexer/product_enrich.py` 中 `_make_analysis_cache_key`)。
  47 +- **前缀**:`infrastructure.redis.anchor_cache_prefix`(默认 `product_anchors`,`REDIS_ANCHOR_CACHE_PREFIX`)。
  48 +- **值格式**:JSON 字符串(规范化后的分析结果)。
  49 +- **TTL**:`anchor_cache_expire_days`(默认 **30 天**),以秒写入 `SETEX`(**非滑动**,与向量/翻译不同)。
  50 +- **读逻辑**:无 TTL 刷新;仅校验内容是否「有意义」再返回。
  51 +
  52 +**主要代码**:`indexer/product_enrich.py`;与 HTTP 侧对齐说明见 `api/routes/indexer.py` 注释。
  53 +
  54 +---
  55 +
  56 +## 二、进程内缓存(非共享、随进程重启失效)
  57 +
  58 +| 名称 | 用途 | 范围/生命周期 |
  59 +|------|------|----------------|
  60 +| **`get_app_config()`** | 解析并缓存全局 `AppConfig` | `config/loader.py`:`@lru_cache(maxsize=1)`;`reload_app_config()` 可 `cache_clear()` |
  61 +| **`TranslationService` 单例** | 翻译服务进程内复用后端与 Redis 客户端 | `api/translator_app.py`:`get_translation_service()` |
  62 +| **`_nllb_tokenizer_code_by_normalized_key`** | NLLB tokenizer 语言码映射 | `translation/languages.py`:`@lru_cache(maxsize=1)` |
  63 +| **`QueryTextAnalysisCache`** | 单次查询解析内复用分词、tokenizer 结果 | `query/tokenization.py`,随 `QueryParser` 一次 parse |
  64 +| **`_SelectionContext`(SKU 意图)** | 归一化文本、分词、匹配布尔等小字典 | `search/sku_intent_selector.py`,单次选择流程 |
  65 +| **`incremental_service` transformer 缓存** | 按 `tenant_id` 缓存文档转换器 | `indexer/incremental_service.py`,**无界**、多租户进程长期存活时需注意内存 |
  66 +| **NLLB batch 内 `token_count_cache`** | 同一 batch 内避免重复计 token | `translation/backends/local_ctranslate2.py` |
  67 +| **CLIP 分词器 `@lru_cache`**(第三方) | 简单 tokenizer 缓存 | `third-party/clip-as-service/.../simple_tokenizer.py` |
  68 +
  69 +**说明**:`utils/cache.py` 中的 **`DictCache`**(文件 JSON:默认 `.cache/dict_cache.json`)已导出,但仓库内**无直接 `DictCache(` 调用**,视为可复用工具/预留,非当前主路径。
  70 +
  71 +---
  72 +
  73 +## 三、磁盘与模型相关「缓存」(非 Redis)
  74 +
  75 +| 名称 | 用途 | 配置/位置 |
  76 +|------|------|-----------|
  77 +| **Hugging Face / 本地模型目录** | 重排器、翻译本地模型等权重下载与缓存 | `services.rerank.backends.*.cache_dir` 等,常见默认 **`./model_cache`**(`config/config.yaml`) |
  78 +| **vLLM `enable_prefix_caching`** | 重排服务内 **Prefix KV 缓存**(加速同前缀批推理) | `services.rerank.backends.qwen3_vllm*`、`reranker/backends/qwen3_vllm*.py` |
  79 +| **运行时目录** | 重排服务状态/引擎文件 | `services.rerank.instances.*.runtime_dir`(如 `./.runtime/reranker/...`) |
  80 +
  81 +翻译能力里的 **`use_cache: true`**(如 NLLB、Marian)在多数后端指 **推理时的 KV cache(Transformer)**,与 Redis 译文缓存是不同层次;Redis 译文缓存仍由 `TranslationCache` 控制。
  82 +
  83 +---
  84 +
  85 +## 四、Elasticsearch 内部缓存
  86 +
  87 +索引设置中的 `refresh_interval` 等影响近实时可见性,但**不属于应用层键值缓存**。若需调优 ES 查询缓存、节点堆等,见运维文档与集群配置,此处不展开。
  88 +
  89 +---
  90 +
  91 +## 五、运维与巡检脚本(Redis)
  92 +
  93 +| 脚本 | 作用 |
  94 +|------|------|
  95 +| `scripts/redis/redis_cache_health_check.py` | 按 **embedding / translation / anchors** 三类前缀巡检:key 数量估算、TTL 采样、`IDLETIME` 等 |
  96 +| `scripts/redis/redis_cache_prefix_stats.py` | 按前缀统计 key 数量与 **MEMORY USAGE**(可多 DB) |
  97 +| `scripts/redis/redis_memory_heavy_keys.py` | 扫描占用内存最大的 key,辅助排查「统计与总内存不一致」 |
  98 +| `scripts/redis/monitor_eviction.py` | 实时监控 **eviction** 相关事件,用于容量与驱逐策略排查 |
  99 +
  100 +使用前需加载项目配置(如 `source activate.sh`)以保证 `REDIS_CONFIG` 与生产一致。脚本注释中给出了 **`redis-cli` 手工统计**示例(按前缀 `wc -l`、`MEMORY STATS` 等)。
  101 +
  102 +---
  103 +
  104 +## 六、总表(Redis 与各层缓存)
  105 +
  106 +| 缓存名称 | 业务模块 | 存储 | Key 前缀 / 命名模式 | 过期时间 | 过期策略 | 值摘要 | 配置键 / 环境变量 |
  107 +|----------|----------|------|---------------------|----------|----------|--------|-------------------|
  108 +| 文本向量 | 检索 / 索引 / Embedding 服务 | Redis db≈0 | `{embedding_cache_prefix}:*`(逻辑键以 `embed:norm…` 开头) | `cache_expire_days`(默认 720 天) | 写入 TTL + 命中滑动续期 | BF16 字节向量 | `infrastructure.redis.*`;`REDIS_EMBEDDING_CACHE_PREFIX`、`REDIS_CACHE_EXPIRE_DAYS` |
  109 +| 图像向量(CLIP 图) | 图搜 / 多模态 | 同上 | `{prefix}:image:*` | 同上 | 同上 | BF16 字节 | 同上 |
  110 +| CLIP 文本塔向量 | 图搜文本侧 | 同上 | `{prefix}:clip_text:*` | 同上 | 同上 | BF16 字节 | 同上 |
  111 +| 翻译译文 | 查询翻译、翻译服务 | 同上 | `trans:{model}:{lang}:*` | `services.translation.cache.ttl_seconds`(默认 720 天) | 可配置滑动(`sliding_expiration`) | UTF-8 字符串 | `services.translation.cache.*`;各能力 `use_cache` |
  112 +| 商品分析 / Anchors | 索引富化、LLM 内容理解 | 同上 | `{anchor_cache_prefix}:{kind}:{hash}:{lang}:*` | `anchor_cache_expire_days`(默认 30 天) | 固定 TTL,不滑动 | JSON 字符串 | `anchor_cache_prefix`、`anchor_cache_expire_days`;`REDIS_ANCHOR_*` |
  113 +| 应用配置 | 全栈 | 进程内存 | N/A(单例) | 进程生命周期 | `reload_app_config` 清除 | `AppConfig` 对象 | `config/loader.py` |
  114 +| 翻译服务实例 | 翻译 API | 进程内存 | N/A | 进程生命周期 | 单例 | `TranslationService` | `api/translator_app.py` |
  115 +| 查询分词缓存 | 查询解析 | 单次请求内 | N/A | 单次 parse | — | 分词与中间结果 | `query/tokenization.py` |
  116 +| SKU 意图辅助字典 | 搜索排序辅助 | 单次请求内 | N/A | 单次选择 | — | 小 dict | `search/sku_intent_selector.py` |
  117 +| 增量索引 Transformer | 索引管道 | 进程内存 | `tenant_id` 字符串键 | 长期(无界) | 无自动淘汰 | Transformer 元组 | `indexer/incremental_service.py` |
  118 +| 重排 / 翻译模型权重 | 推理服务 | 本地磁盘 | 目录路径 | 无自动删除(人工清理) | — | 模型文件 | `cache_dir: ./model_cache` 等 |
  119 +| vLLM Prefix 缓存 | 重排(Qwen3 等) | GPU/引擎内 | 引擎内部 | 引擎管理 | — | KV Cache | `enable_prefix_caching` |
  120 +| 文件 Dict 缓存(可选) | 通用 | `.cache/dict_cache.json` | 分类 + 自定义 key | 持久直至删除 | — | JSON 可序列化值 | `utils/cache.py`(当前无调用方) |
  121 +
  122 +---
  123 +
  124 +## 七、维护建议(简要)
  125 +
  126 +1. **容量**:三类 Redis 缓存(embedding / trans / anchors)可共用同一实例;大租户或图搜多时 **embedding** 与 **trans** 往往占主要内存,可用 `redis_cache_prefix_stats.py` 分前缀观察。
  127 +2. **键迁移**:变更 `embedding_cache_prefix`、CLIP `model_name` 或 prompt 契约会自然**隔离新键空间**;旧键依赖 TTL 或人工批量删除。
  128 +3. **一致性**:向量缓存对异常向量会 **delete key**(`RedisEmbeddingCache.get`);anchors 依赖 `cache_version` 与契约 hash 防止错误复用。
  129 +4. **监控**:除脚本外,Embedding HTTP 服务健康检查会报告各 lane 的 **`cache_enabled`**(`embeddings/server.py`)。
  130 +
  131 +---
  132 +
  133 +*文档随代码扫描生成;若新增 Redis 用途,请同步更新本文件与 `scripts/redis/redis_cache_health_check.py` 中的 `_load_known_cache_types()`。*
docs/issues/issue-2026-04-08-eval框架主指标ERR的问题以及bm25调参-done-0408.md 0 → 100644
@@ -0,0 +1,120 @@ @@ -0,0 +1,120 @@
  1 +1. 目前检索系统评测的主要指标是这几个
  2 + "NDCG@20, NDCG@50, ERR@10, Strong_Precision@10, Strong_Precision@20, "
  3 +参考_err_at_k,计算逻辑好像没问题
  4 +现在的问题是,ERR 指标跟其他几个指标好像经常有相反的趋势。请再分析他是否适合作为主指标之一,目前有什么问题。
  5 +
  6 +2. 目前bm25参数是:
  7 +"b": 0.1,
  8 +"k1": 0.3
  9 +对应的基线是 /data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260408T055948Z_00b6a8aa3d.md (Primary_Metric_Score: 0.604555
  10 +
  11 +)
  12 +
  13 +(比之前b和k1都设置为0好了很多,之前都设置为0的情况:/data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260407T150946Z_00b6a8aa3d.md
  14 + Primary_Metric_Score: 0.602598
  15 +
  16 +)
  17 +
  18 +这两个参数从0改为0.1/0.3的背景是:
  19 +This change adjusts the BM25 parameters used by the combined query.
  20 +
  21 +Previously, both `b` and `k1` were set to `0.0`. The original intention was to avoid two common issues in e-commerce search relevance:
  22 +
  23 +1. Over-penalizing longer product titles
  24 + In product search, a shorter title should not automatically rank higher just because BM25 favors shorter fields. For example, for a query like “遥控车”, a product whose title is simply “遥控车” is not necessarily a better candidate than a product with a slightly longer but more descriptive title. In practice, extremely short titles may even indicate lower-quality catalog data.
  25 +
  26 +2. Over-rewarding repeated occurrences of the same term
  27 + For longer queries such as “遥控喷雾翻滚多功能车玩具车”, the default BM25 behavior may give too much weight to a term that appears multiple times (for example “遥控”), even when other important query terms such as “喷雾” or “翻滚” are missing. This can cause products with repeated partial matches to outrank products that actually cover more of the user intent.
  28 +
  29 +Setting both parameters to zero was an intentional way to suppress length normalization and term-frequency amplification. However, after introducing a `combined_fields` query, this configuration becomes too aggressive. Since `combined_fields` scores multiple fields as a unified relevance signal, completely disabling both effects may also remove useful ranking information, especially when we still want documents matching more query terms across fields to be distinguishable from weaker matches.
  30 +
  31 +This update therefore relaxes the previous setting and reintroduces a controlled amount of BM25 normalization/scoring behavior. The goal is to keep the original intent — avoiding short-title bias and excessive repeated-term gain — while allowing the combined query to better preserve meaningful relevance differences across candidates.
  32 +
  33 +Expected effect:
  34 +- reduce the bias toward unnaturally short product titles
  35 +- limit score inflation caused by repeated occurrences of the same term
  36 +- improve ranking stability for `combined_fields` queries
  37 +- better reward candidates that cover more of the overall query intent, instead of those that only repeat a subset of terms
  38 +
  39 +
  40 +因为实验有效,因此帮我继续进行实验
  41 +
  42 +请帮我再进行这四轮实验,对比效果,优化bm25参数:
  43 +{ "b": 0.10, "k1": 0.30 }
  44 +{ "b": 0.20, "k1": 0.60 }
  45 +{ "b": 0.50, "k1": 1.0 }
  46 +{ "b": 0.10, "k1": 0.75 }
  47 +
  48 +参考修改索引级设置的方法:( BM25 `similarity.default`)
  49 +
  50 +`mappings/search_products.json` 里的 `settings.similarity` 只在**创建新索引**时生效;**已有索引**需先关闭索引,再 `PUT _settings`,最后重新打开。
  51 +
  52 +**适用场景**:调整默认 BM25 的 `b`、`k1`(例如与仓库映射对齐:`b: 0.1`、`k1: 0.3`)。
  53 +
  54 +```bash
  55 +# 按需替换:索引名、账号密码、ES 地址
  56 +INDEX="search_products_tenant_163"
  57 +AUTH='saas:4hOaLaf41y2VuI8y'
  58 +ES="http://localhost:9200"
  59 +
  60 +# 1) 关闭索引(写入类请求会失败,注意维护窗口)
  61 +curl -s -u "$AUTH" -X POST "$ES/${INDEX}/_close"
  62 +
  63 +# 2) 更新设置(仅示例:与 mappings 中 default 一致时可照抄)
  64 +curl -s -u "$AUTH" -X PUT "$ES/${INDEX}/_settings" \
  65 + -H 'Content-Type: application/json' \
  66 + -d '{
  67 + "index": {
  68 + "similarity": {
  69 + "default": {
  70 + "type": "BM25",
  71 + "b": 0.1,
  72 + "k1": 0.3
  73 + }
  74 + }
  75 + }
  76 +}'
  77 +
  78 +# 3) 重新打开索引
  79 +curl -s -u "$AUTH" -X POST "$ES/${INDEX}/_open"
  80 +```
  81 +
  82 +**检查是否生效**:
  83 +
  84 +```bash
  85 +curl -s -u "$AUTH" -X GET "$ES/${INDEX}/_settings?filter_path=**.similarity&pretty"
  86 +```
  87 +
  88 +期望在响应中看到 `similarity.default` 的 `type`、`b`、`k1`(API 可能将数值以字符串形式返回,属正常)。
  89 +
  90 +**多租户批量**:先列出索引,再对每个 `search_products_tenant_*` 重复上述 close → settings → open。
  91 +
  92 +```bash
  93 +curl -s -u "$AUTH" -X GET "$ES/_cat/indices/search_products_tenant_*?h=index&v"
  94 +```
  95 +
  96 +
  97 +
  98 +实验结果:
  99 +已完成四轮实验,并把 `search_products_tenant_163` 当前索引参数留在了最优组:`b=0.5, k1=1.0`。刚刚已校验生效。
  100 +
  101 +对比结果如下:
  102 +
  103 +- 基线当前线上组 `b=0.10, k1=0.30`:`0.604555`
  104 + 报告:[batch_20260408T055948Z_00b6a8aa3d.md](/data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260408T055948Z_00b6a8aa3d.md)
  105 +- 复跑 `b=0.10, k1=0.30`:`0.604298`
  106 + 报告:[batch_20260408T064011Z_00b6a8aa3d.json](/data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260408T064011Z_00b6a8aa3d.json)
  107 +- 实验 `b=0.20, k1=0.60`:`0.604274`
  108 + 报告:[batch_20260408T064123Z_00b6a8aa3d.json](/data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260408T064123Z_00b6a8aa3d.json)
  109 +- 实验 `b=0.50, k1=1.0`:`0.606041`
  110 + 报告:[batch_20260408T064216Z_00b6a8aa3d.json](/data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260408T064216Z_00b6a8aa3d.json)
  111 +- 实验 `b=0.10, k1=0.75`:`0.603438`
  112 + 报告:[batch_20260408T064304Z_00b6a8aa3d.json](/data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260408T064304Z_00b6a8aa3d.json)
  113 +- 老基线 `b=0.0, k1=0.0`:`0.602598`
  114 + 报告:[batch_20260407T150946Z_00b6a8aa3d.md](/data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260407T150946Z_00b6a8aa3d.md)
  115 +
  116 +结论是这四组里 `b=0.50, k1=1.0` 最好,相比你给的当前基线 `0.604555` 提升了 `+0.001486`,相比 `0/0` 提升了 `+0.003443`。`0.20/0.60` 基本和 `0.10/0.30` 持平,`0.10/0.75` 反而更差,说明这轮里单独加大 `k1` 不如同时把 `b` 和 `k1` 一起放宽。
  117 +
  118 +补充一点:我只改了现有索引 `search_products_tenant_163` 的 index-level setting,没有改仓库里的默认映射。[search_products.json](/data/saas-search/mappings/search_products.json) 现在仍是 `0.1/0.3`。如果你要,我下一步可以直接帮你把映射默认值也改成 `0.5/1.0`,并顺手批量同步其他 `search_products_tenant_*` 索引。
  119 +
  120 +
docs/issues/issue-2026-04-12-test-env.md 0 → 100644
@@ -0,0 +1,43 @@ @@ -0,0 +1,43 @@
  1 +120.76.41.98 端口22 用户名和密码:
  2 +tw twtw@123 (有sudo权限)
  3 +这台机器上的目录/home/tw/saas-search 已经部署了本项目
  4 +请帮我运行项目
  5 +1. 帮我checkout一个test环境的分支,这个分支,把重排、翻译模型 都关闭掉,因为这台机gpu显存较小(embedding模型可以保留)
  6 +2. 在这个分支,把服务都启动起来
  7 +3. 使用docker,安装一个ES,参考本项目的文档 ES9*.md。因为这台机器已经有一个系统的elasticsearch,为了不相互干扰,将本项目依赖的es9安装到docker,并且在测试环境配置的es地址做适配的工作
  8 +
  9 +
  10 +1. 不是要禁用6005,而是6005端口已经有对应的文本服务了,直接用就行
  11 +2. 6005其实就是本项目的一个历史早期版本启动起来的,在另外一个目录:/home/tw/SearchEngine,请看他的启动配置
  12 +nohup bash scripts/start_embedding_service.sh > log.start_embedding_service.0412 2>&1 &
  13 +是这样启动起来的
  14 +看他陪的文本是用的哪套方案、哪个模型,跟他对齐(我指的是当前的测试分支)
  15 +
  16 +
  17 +
  18 +
  19 +
  20 +
  21 +
  22 +我在这个机器上部署了一个测试环境:
  23 +120.76.41.98 端口22 用户名和密码:
  24 +tw twtw@123 (有sudo权限)
  25 +cd /home/tw/saas-search
  26 +$ git branch
  27 + masters RETURN)
  28 +* test/small-gpu-es9
  29 +
  30 +我希望差异只是:
  31 +1. es配置不同(测试环境要连接到那台机器的一个docer的es 19200端口)、redis配置不同
  32 +2. reranker关闭、不要启动reranker服务
  33 +
  34 +其余没什么不同。
  35 +
  36 +但是启动有问题,现在翻译报错。
  37 +这体现了当前项目移植性比较差,我希望你检查一下失败原因,然后先到本地(本机 即当前目录master分支)优化好、提升移植性之后,那边更新,保持测试分支跟master只有少量的、配置层面的不同,让后到测试机器把翻译启动起来,最后包括整个服务都要启动起来。
  38 +
  39 +
  40 +
  41 +
  42 +
  43 +
docs/issues/issue-2026-04-14-粗排流程放入ES-TODO-env 0 → 100644
@@ -0,0 +1,25 @@ @@ -0,0 +1,25 @@
  1 +需求:
  2 +目前160条结果(rerank_window: 160)会进入重排,重排中 文本和图片向量的相关性,都会作为融合公式的因子之一(粗排和reranker都有):
  3 +knn_score
  4 +text_knn
  5 +image_knn
  6 +text_factor
  7 +knn_factor
  8 +但是文本向量召回和图片向量召回,是使用 KNN 索引召回的方式,并不是所有结果都有这两个得分,这两项得分都有为0的。
  9 +为了解决这个问题,有一个方法是对最终能进入重排的 160 条,看其中还有哪些分别缺失文本和图片向量召回的得分,再通过某种方式让 ES 去算,或者从 ES 把向量拉回来,自己算,或者在召回的时候请求 ES 的时候,就通过某种设定,确保前面的若干条都带有这两个分数,不知道有哪些方法,我感觉这些方法都不太好,请你思考一下
  10 +
  11 +考虑的一个方案:
  12 +想在“第一次 ES 搜索”里,只对 topN 补向量精算,考虑 rescore 或 retriever.rescorer的方案(官方明确支持多段 rescore/支持 score_mode: multiply,甚至示例里就有 function_score/script_score 放进 rescore 的写法。)
  13 +这意味着你完全可以:
  14 +初检仍然用现在的 lexical + text knn + image knn 召回候选
  15 +对 window_size=160 做 rescore
  16 +用 exact script_score 给 top160 补 text/image vector 分
  17 +顺手把你现在本地 coarse 融合迁回 ES
  18 +
  19 +export ES_AUTH="saas:4hOaLaf41y2VuI8y"
  20 +export ES="http://127.0.0.1:9200"
  21 +"index":"search_products_tenant_163"
  22 +
  23 +有个细节暴露出来了:dotProduct() 这类向量函数在 script_score 评分上下文能用,但在 script_fields 取字段上下文里不认。所以如果我们要把 exact 分顺手回传给 rerank,用 script_fields 的话得自己写数组循环,不能直接调向量内建函数。
  24 +
  25 +重排打分公式需要的base_query base_query_trans_zh knn_query image_knn_query还能不能拿到?请你考虑,尽量想想如何得到这些打分,如果实在拿不到去想替代的办法比如简化打分公式。
docs/工作总结-微服务性能优化与架构.md
@@ -98,7 +98,7 @@ instruction: &quot;Given a shopping query, rank product titles by relevance&quot; @@ -98,7 +98,7 @@ instruction: &quot;Given a shopping query, rank product titles by relevance&quot;
98 **能力**:支持根据商品标题批量生成 **qanchors**(锚文本)、**enriched_attributes**、**tags**,供索引与 suggest 使用。 98 **能力**:支持根据商品标题批量生成 **qanchors**(锚文本)、**enriched_attributes**、**tags**,供索引与 suggest 使用。
99 99
100 **具体内容**: 100 **具体内容**:
101 -- **接口**:`POST /indexer/enrich-content`(Indexer 服务端口 **6004**)。请求体为 `items` 数组,每项含 `spu_id`、`title`(必填)及可选多语言标题等;单次请求最多 **50 条**,建议批量调用。响应 `results` 与 `items` 一一对应,每项含 `spu_id`、`qanchors`(按语言键,如 `qanchors.zh`、`qanchors.en`,逗号分隔短语)、`enriched_attributes`、`tags`。 101 +- **接口**:`POST /indexer/enrich-content`(FacetAwareMatching 服务端口 **6001**)。请求体为 `items` 数组,每项含 `spu_id`、`title`(必填)及可选多语言标题等;单次请求最多 **50 条**,建议批量调用。响应 `results` 与 `items` 一一对应,每项含 `spu_id`、`qanchors`(按语言键,如 `qanchors.zh`、`qanchors.en`,逗号分隔短语)、`enriched_attributes`、`tags`。
102 - **索引侧**:微服务组合方式下,调用方先拿不含 qanchors/tags 的 doc,再调用本接口补齐后写入 ES 的 `qanchors.{lang}` 等字段;索引 transformer(`indexer/document_transformer.py`、`indexer/product_enrich.py`)内也可在构建 doc 时调用内容理解逻辑,写入 `qanchors.{lang}`。 102 - **索引侧**:微服务组合方式下,调用方先拿不含 qanchors/tags 的 doc,再调用本接口补齐后写入 ES 的 `qanchors.{lang}` 等字段;索引 transformer(`indexer/document_transformer.py`、`indexer/product_enrich.py`)内也可在构建 doc 时调用内容理解逻辑,写入 `qanchors.{lang}`。
103 - **Suggest 侧**:`suggestion/builder.py` 从 ES 商品索引读取 `_source: ["id", "spu_id", "title", "qanchors"]`,对 `qanchors.{lang}` 用 `_split_qanchors` 拆成词条,以 `source="qanchor"` 加入候选,排序时 `qanchor` 权重大于纯 title(`add_product("qanchor", ...)`);suggest 配置中 `sources: ["query_log", "qanchor"]` 表示候选来源包含 qanchor。 103 - **Suggest 侧**:`suggestion/builder.py` 从 ES 商品索引读取 `_source: ["id", "spu_id", "title", "qanchors"]`,对 `qanchors.{lang}` 用 `_split_qanchors` 拆成词条,以 `source="qanchor"` 加入候选,排序时 `qanchor` 权重大于纯 title(`add_product("qanchor", ...)`);suggest 配置中 `sources: ["query_log", "qanchor"]` 表示候选来源包含 qanchor。
104 - **实现与依赖**:内容理解内部使用大模型(需 `DASHSCOPE_API_KEY`),支持多语言与 Redis 缓存(如 `product_anchors`);逻辑与 `indexer/product_enrich` 一致。 104 - **实现与依赖**:内容理解内部使用大模型(需 `DASHSCOPE_API_KEY`),支持多语言与 Redis 缓存(如 `product_anchors`);逻辑与 `indexer/product_enrich` 一致。
@@ -129,12 +129,12 @@ instruction: &quot;Given a shopping query, rank product titles by relevance&quot; @@ -129,12 +129,12 @@ instruction: &quot;Given a shopping query, rank product titles by relevance&quot;
129 - 可选:embedding(text) **6005**、embedding-image **6008**、translator **6006**、reranker **6007**、tei **8080**、cnclip **51000**。 129 - 可选:embedding(text) **6005**、embedding-image **6008**、translator **6006**、reranker **6007**、tei **8080**、cnclip **51000**。
130 - 端口可由环境变量覆盖:`API_PORT`、`INDEXER_PORT`、`FRONTEND_PORT`、`EVAL_WEB_PORT`、`EMBEDDING_TEXT_PORT`、`EMBEDDING_IMAGE_PORT`、`TRANSLATION_PORT`、`RERANKER_PORT`、`TEI_PORT`、`CNCLIP_PORT`。 130 - 端口可由环境变量覆盖:`API_PORT`、`INDEXER_PORT`、`FRONTEND_PORT`、`EVAL_WEB_PORT`、`EMBEDDING_TEXT_PORT`、`EMBEDDING_IMAGE_PORT`、`TRANSLATION_PORT`、`RERANKER_PORT`、`TEI_PORT`、`CNCLIP_PORT`。
131 - **命令**: 131 - **命令**:
132 - - `./scripts/service_ctl.sh start [service...]` 或 `up all` / `start all`(all 含 tei、cnclip、embedding、embedding-image、translator、reranker、reranker-fine、backend、indexer、frontend、eval-web,按依赖顺序);`stop`、`restart`、`down` 同参数;`status` 默认列出所有服务。 132 + - `./scripts/service_ctl.sh start [service...]` 或 `up all` / `start all`(all 含 tei、cnclip、embedding、embedding-image、translator、reranker、backend、indexer、frontend、eval-web,按依赖顺序);`stop`、`restart`、`down` 同参数;`status` 默认列出所有服务。
133 - 启动时:backend/indexer/frontend/embedding/translator/reranker 会写 pid 到 `logs/<service>.pid`,并执行 `wait_for_health`(GET `http://127.0.0.1:<port>/health`);reranker 健康重试 90 次,其余 30 次;TEI 校验 Docker 容器存在且 `/health` 成功;cnclip 无 HTTP 健康则仅校验进程/端口。 133 - 启动时:backend/indexer/frontend/embedding/translator/reranker 会写 pid 到 `logs/<service>.pid`,并执行 `wait_for_health`(GET `http://127.0.0.1:<port>/health`);reranker 健康重试 90 次,其余 30 次;TEI 校验 Docker 容器存在且 `/health` 成功;cnclip 无 HTTP 健康则仅校验进程/端口。
134 - **监控常驻**: 134 - **监控常驻**:
135 - `./scripts/service_ctl.sh monitor-start <targets>` 启动后台监控进程,将 targets 写入 `logs/service-monitor.targets`,pid 写入 `logs/service-monitor.pid`,日志追加到 `logs/service-monitor.log`。 135 - `./scripts/service_ctl.sh monitor-start <targets>` 启动后台监控进程,将 targets 写入 `logs/service-monitor.targets`,pid 写入 `logs/service-monitor.pid`,日志追加到 `logs/service-monitor.log`。
136 - - 轮询间隔 `MONITOR_INTERVAL_SEC` 默认 **10** 秒;连续 **3** 次(`MONITOR_FAIL_THRESHOLD`)健康失败则触发重启;重启冷却 `MONITOR_RESTART_COOLDOWN_SEC` 默认 **30** 秒;每小时最多重启 `MONITOR_MAX_RESTARTS_PER_HOUR` 默认 **6** 次;超限时调用 `scripts/wechat_alert.py` 告警(若存在)。  
137 -- **日志**:各服务按日滚动到 `logs/<service>-<date>.log`,通过 `scripts/daily_log_router.sh` 与 `LOG_RETENTION_DAYS`(默认 30)控制保留。 136 + - 轮询间隔 `MONITOR_INTERVAL_SEC` 默认 **10** 秒;连续 **3** 次(`MONITOR_FAIL_THRESHOLD`)健康失败则触发重启;重启冷却 `MONITOR_RESTART_COOLDOWN_SEC` 默认 **30** 秒;每小时最多重启 `MONITOR_MAX_RESTARTS_PER_HOUR` 默认 **6** 次;超限时调用 `scripts/ops/wechat_alert.py` 告警(若存在)。
  137 +- **日志**:各服务按日滚动到 `logs/<service>-<date>.log`,通过 `scripts/ops/daily_log_router.sh` 与 `LOG_RETENTION_DAYS`(默认 30)控制保留。
138 138
139 详见:`scripts/service_ctl.sh` 内注释及 `docs/Usage-Guide.md`。 139 详见:`scripts/service_ctl.sh` 内注释及 `docs/Usage-Guide.md`。
140 140
@@ -153,12 +153,12 @@ instruction: &quot;Given a shopping query, rank product titles by relevance&quot; @@ -153,12 +153,12 @@ instruction: &quot;Given a shopping query, rank product titles by relevance&quot;
153 153
154 ## 三、性能测试报告摘要 154 ## 三、性能测试报告摘要
155 155
156 -以下数据来自 `docs/性能测试报告.md`,测试时间 **2026-03-12**,环境:**8 vCPU**(Intel Xeon Platinum 8255C @ 2.50GHz)、**约 15Gi 可用内存**;租户 **162** 文档数约 **53**(search/search/suggestions/rerank 与文档规模相关)。压测工具:`scripts/perf_api_benchmark.py`,场景×并发矩阵,每档 **20s**。 156 +以下数据来自 `docs/性能测试报告.md`,测试时间 **2026-03-12**,环境:**8 vCPU**(Intel Xeon Platinum 8255C @ 2.50GHz)、**约 15Gi 可用内存**;租户 **162** 文档数约 **53**(search/search/suggestions/rerank 与文档规模相关)。压测工具:`benchmarks/perf_api_benchmark.py`,场景×并发矩阵,每档 **20s**。
157 157
158 **复现命令(四场景×四并发)**: 158 **复现命令(四场景×四并发)**:
159 ```bash 159 ```bash
160 cd /data/saas-search 160 cd /data/saas-search
161 -.venv/bin/python scripts/perf_api_benchmark.py \ 161 +.venv/bin/python benchmarks/perf_api_benchmark.py \
162 --scenario backend_search,backend_suggest,embed_text,rerank \ 162 --scenario backend_search,backend_suggest,embed_text,rerank \
163 --concurrency-list 1,5,10,20 \ 163 --concurrency-list 1,5,10,20 \
164 --duration 20 \ 164 --duration 20 \
@@ -188,7 +188,7 @@ cd /data/saas-search @@ -188,7 +188,7 @@ cd /data/saas-search
188 188
189 口径:query 固定 `wireless mouse`,每次请求 **386 docs**,句长 15–40 词随机(从 1000 词池采样);配置 `rerank_window=384`。复现命令: 189 口径:query 固定 `wireless mouse`,每次请求 **386 docs**,句长 15–40 词随机(从 1000 词池采样);配置 `rerank_window=384`。复现命令:
190 ```bash 190 ```bash
191 -.venv/bin/python scripts/perf_api_benchmark.py \ 191 +.venv/bin/python benchmarks/perf_api_benchmark.py \
192 --scenario rerank --duration 20 --concurrency-list 1,5,10,20 --timeout 60 \ 192 --scenario rerank --duration 20 --concurrency-list 1,5,10,20 --timeout 60 \
193 --rerank-dynamic-docs --rerank-doc-count 386 --rerank-vocab-size 1000 \ 193 --rerank-dynamic-docs --rerank-doc-count 386 --rerank-vocab-size 1000 \
194 --rerank-sentence-min-words 15 --rerank-sentence-max-words 40 \ 194 --rerank-sentence-min-words 15 --rerank-sentence-max-words 40 \
@@ -217,7 +217,7 @@ cd /data/saas-search @@ -217,7 +217,7 @@ cd /data/saas-search
217 | 10 | 181 | 100% | 8.78 | 1129.23| 1295.88| 1330.96| 217 | 10 | 181 | 100% | 8.78 | 1129.23| 1295.88| 1330.96|
218 | 20 | 161 | 100% | 7.63 | 2594.00| 4706.44| 4783.05| 218 | 20 | 161 | 100% | 7.63 | 2594.00| 4706.44| 4783.05|
219 219
220 -**结论**:吞吐约 **8 rps** 平台化,延迟随并发上升明显,符合“检索 + 向量 + 重排”重链路特征。多租户补测(文档数 500–10000,见报告 §12)表明:文档数越大,RPS 下降、延迟升高;tenant 0(10000 doc)在并发 20 出现部分 ReadTimeout(成功率 59.02%),需注意 timeout 与容量规划;补测命令示例:`for t in 0 1 2 3 4; do .venv/bin/python scripts/perf_api_benchmark.py --scenario backend_search --concurrency-list 1,5,10,20 --duration 20 --tenant-id $t --output perf_reports/2026-03-12/search_tenant_matrix/tenant_${t}.json; done`。 220 +**结论**:吞吐约 **8 rps** 平台化,延迟随并发上升明显,符合“检索 + 向量 + 重排”重链路特征。多租户补测(文档数 500–10000,见报告 §12)表明:文档数越大,RPS 下降、延迟升高;tenant 0(10000 doc)在并发 20 出现部分 ReadTimeout(成功率 59.02%),需注意 timeout 与容量规划;补测命令示例:`for t in 0 1 2 3 4; do .venv/bin/python benchmarks/perf_api_benchmark.py --scenario backend_search --concurrency-list 1,5,10,20 --duration 20 --tenant-id $t --output perf_reports/2026-03-12/search_tenant_matrix/tenant_${t}.json; done`。
221 221
222 --- 222 ---
223 223
@@ -247,5 +247,5 @@ cd /data/saas-search @@ -247,5 +247,5 @@ cd /data/saas-search
247 247
248 **关键文件与复现**: 248 **关键文件与复现**:
249 - 配置:`config/config.yaml`(services、rerank、query_config)、`.env`(端口与 API Key)。 249 - 配置:`config/config.yaml`(services、rerank、query_config)、`.env`(端口与 API Key)。
250 -- 脚本:`scripts/service_ctl.sh`(启停与监控)、`scripts/perf_api_benchmark.py`(压测)、`scripts/build_suggestions.sh`(suggest 构建)。 250 +- 脚本:`scripts/service_ctl.sh`(启停与监控)、`benchmarks/perf_api_benchmark.py`(压测)、`scripts/build_suggestions.sh`(suggest 构建)。
251 - 完整步骤与多租户/rerank 对比见:`docs/性能测试报告.md`。 251 - 完整步骤与多租户/rerank 对比见:`docs/性能测试报告.md`。
docs/常用查询 - ES.md
1 -  
2 -  
3 ## Elasticsearch 排查流程 1 ## Elasticsearch 排查流程
4 2
  3 +使用前加载环境变量:
  4 +```bash
  5 +set -a; source .env; set +a
  6 +# 或直接 export
  7 +export ES_AUTH="saas:4hOaLaf41y2VuI8y"
  8 +export ES="http://127.0.0.1:9200"
  9 +```
  10 +
5 ### 1. 集群健康状态 11 ### 1. 集群健康状态
6 12
7 ```bash 13 ```bash
8 # 集群整体健康(green / yellow / red) 14 # 集群整体健康(green / yellow / red)
9 -curl -s -u 'saas:4hOaLaf41y2VuI8y' 'http://127.0.0.1:9200/_cluster/health?pretty' 15 +curl -s -u "$ES_AUTH" 'http://127.0.0.1:9200/_cluster/health?pretty'
10 ``` 16 ```
11 17
12 ### 2. 索引概览 18 ### 2. 索引概览
13 19
14 ```bash 20 ```bash
15 # 查看所有租户索引状态与体积 21 # 查看所有租户索引状态与体积
16 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/_cat/indices/search_products_tenant_*?v' 22 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/_cat/indices/search_products_tenant_*?v'
17 23
18 # 或查看全部索引 24 # 或查看全部索引
19 -curl -s -u 'saas:4hOaLaf41y2VuI8y' 'http://127.0.0.1:9200/_cat/indices?v' 25 +curl -s -u "$ES_AUTH" 'http://127.0.0.1:9200/_cat/indices?v'
20 ``` 26 ```
21 27
22 ### 3. 分片分布 28 ### 3. 分片分布
23 29
24 ```bash 30 ```bash
25 # 查看分片在各节点的分布情况 31 # 查看分片在各节点的分布情况
26 -curl -s -u 'saas:4hOaLaf41y2VuI8y' 'http://127.0.0.1:9200/_cat/shards?v' 32 +curl -s -u "$ES_AUTH" 'http://127.0.0.1:9200/_cat/shards?v'
27 ``` 33 ```
28 34
29 ### 4. 分配诊断(如有异常) 35 ### 4. 分配诊断(如有异常)
30 36
31 ```bash 37 ```bash
32 # 当 health 非 green 或 shards 状态异常时,定位具体原因 38 # 当 health 非 green 或 shards 状态异常时,定位具体原因
33 -curl -s -u 'saas:4hOaLaf41y2VuI8y' -X POST 'http://127.0.0.1:9200/_cluster/allocation/explain?pretty' \ 39 +curl -s -u "$ES_AUTH" -X POST 'http://127.0.0.1:9200/_cluster/allocation/explain?pretty' \
34 -H 'Content-Type: application/json' \ 40 -H 'Content-Type: application/json' \
35 -d '{"index":"search_products_tenant_163","shard":0,"primary":true}' 41 -d '{"index":"search_products_tenant_163","shard":0,"primary":true}'
36 ``` 42 ```
@@ -60,6 +66,54 @@ cat /etc/elasticsearch/elasticsearch.yml @@ -60,6 +66,54 @@ cat /etc/elasticsearch/elasticsearch.yml
60 journalctl -u elasticsearch -f 66 journalctl -u elasticsearch -f
61 ``` 67 ```
62 68
  69 +### 7. 修改索引级设置(如 BM25 `similarity.default`)
  70 +
  71 +`mappings/search_products.json` 里的 `settings.similarity` 只在**创建新索引**时生效;**已有索引**需先关闭索引,再 `PUT _settings`,最后重新打开。
  72 +
  73 +**适用场景**:调整默认 BM25 的 `b`、`k1`(例如与仓库映射对齐:`b: 0.1`、`k1: 0.3`)。
  74 +
  75 +```bash
  76 +# 按需替换:索引名、账号密码、ES 地址
  77 +INDEX="search_products_tenant_163"
  78 +AUTH="$ES_AUTH"
  79 +ES="http://localhost:9200"
  80 +
  81 +# 1) 关闭索引(写入类请求会失败,注意维护窗口)
  82 +curl -s -u "$AUTH" -X POST "$ES/${INDEX}/_close"
  83 +
  84 +# 2) 更新设置(仅示例:与 mappings 中 default 一致时可照抄)
  85 +curl -s -u "$AUTH" -X PUT "$ES/${INDEX}/_settings" \
  86 + -H 'Content-Type: application/json' \
  87 + -d '{
  88 + "index": {
  89 + "similarity": {
  90 + "default": {
  91 + "type": "BM25",
  92 + "b": 0.1,
  93 + "k1": 0.3
  94 + }
  95 + }
  96 + }
  97 +}'
  98 +
  99 +# 3) 重新打开索引
  100 +curl -s -u "$AUTH" -X POST "$ES/${INDEX}/_open"
  101 +```
  102 +
  103 +**检查是否生效**:
  104 +
  105 +```bash
  106 +curl -s -u "$AUTH" -X GET "$ES/${INDEX}/_settings?filter_path=**.similarity&pretty"
  107 +```
  108 +
  109 +期望在响应中看到 `similarity.default` 的 `type`、`b`、`k1`(API 可能将数值以字符串形式返回,属正常)。
  110 +
  111 +**多租户批量**:先列出索引,再对每个 `search_products_tenant_*` 重复上述 close → settings → open。
  112 +
  113 +```bash
  114 +curl -s -u "$AUTH" -X GET "$ES/_cat/indices/search_products_tenant_*?h=index&v"
  115 +```
  116 +
63 --- 117 ---
64 118
65 ### 快速排查路径 119 ### 快速排查路径
@@ -93,7 +147,7 @@ systemctl / df / 日志 → 系统层验证 @@ -93,7 +147,7 @@ systemctl / df / 日志 → 系统层验证
93 147
94 #### 查询指定 spu_id 的商品(返回 title) 148 #### 查询指定 spu_id 的商品(返回 title)
95 ```bash 149 ```bash
96 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ 150 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
97 "size": 11, 151 "size": 11,
98 "_source": ["title"], 152 "_source": ["title"],
99 "query": { 153 "query": {
@@ -108,7 +162,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te @@ -108,7 +162,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
108 162
109 #### 查询所有商品(返回 title) 163 #### 查询所有商品(返回 title)
110 ```bash 164 ```bash
111 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ 165 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
112 "size": 100, 166 "size": 100,
113 "_source": ["title"], 167 "_source": ["title"],
114 "query": { 168 "query": {
@@ -119,7 +173,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te @@ -119,7 +173,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
119 173
120 #### 查询指定 spu_id 的商品(返回 title、keywords、tags) 174 #### 查询指定 spu_id 的商品(返回 title、keywords、tags)
121 ```bash 175 ```bash
122 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ 176 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
123 "size": 5, 177 "size": 5,
124 "_source": ["title", "keywords", "tags"], 178 "_source": ["title", "keywords", "tags"],
125 "query": { 179 "query": {
@@ -134,7 +188,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te @@ -134,7 +188,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
134 188
135 #### 组合查询:匹配标题 + 过滤标签 189 #### 组合查询:匹配标题 + 过滤标签
136 ```bash 190 ```bash
137 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ 191 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
138 "size": 1, 192 "size": 1,
139 "_source": ["title", "keywords", "tags"], 193 "_source": ["title", "keywords", "tags"],
140 "query": { 194 "query": {
@@ -158,7 +212,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te @@ -158,7 +212,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
158 212
159 #### 组合查询:匹配标题 + 过滤租户(冗余示例) 213 #### 组合查询:匹配标题 + 过滤租户(冗余示例)
160 ```bash 214 ```bash
161 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ 215 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
162 "size": 1, 216 "size": 1,
163 "_source": ["title"], 217 "_source": ["title"],
164 "query": { 218 "query": {
@@ -186,7 +240,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te @@ -186,7 +240,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
186 240
187 #### 测试 index_ik 分析器 241 #### 测试 index_ik 分析器
188 ```bash 242 ```bash
189 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_analyze' -H 'Content-Type: application/json' -d '{ 243 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_analyze' -H 'Content-Type: application/json' -d '{
190 "analyzer": "index_ik", 244 "analyzer": "index_ik",
191 "text": "14寸第4代-眼珠实身冰雪公仔带手动大推车,搪胶雪宝宝" 245 "text": "14寸第4代-眼珠实身冰雪公仔带手动大推车,搪胶雪宝宝"
192 }' 246 }'
@@ -194,7 +248,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te @@ -194,7 +248,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
194 248
195 #### 测试 query_ik 分析器 249 #### 测试 query_ik 分析器
196 ```bash 250 ```bash
197 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_analyze' -H 'Content-Type: application/json' -d '{ 251 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_analyze' -H 'Content-Type: application/json' -d '{
198 "analyzer": "query_ik", 252 "analyzer": "query_ik",
199 "text": "14寸第4代-眼珠实身冰雪公仔带手动大推车,搪胶雪宝宝" 253 "text": "14寸第4代-眼珠实身冰雪公仔带手动大推车,搪胶雪宝宝"
200 }' 254 }'
@@ -206,7 +260,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te @@ -206,7 +260,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
206 260
207 #### 多字段匹配 + 聚合(category1、color、size、material) 261 #### 多字段匹配 + 聚合(category1、color、size、material)
208 ```bash 262 ```bash
209 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ 263 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
210 "size": 1, 264 "size": 1,
211 "from": 0, 265 "from": 0,
212 "query": { 266 "query": {
@@ -316,7 +370,7 @@ GET /search_products_tenant_2/_search @@ -316,7 +370,7 @@ GET /search_products_tenant_2/_search
316 370
317 #### 按 spu_id 查询(通用索引) 371 #### 按 spu_id 查询(通用索引)
318 ```bash 372 ```bash
319 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products/_search?pretty' -H 'Content-Type: application/json' -d '{ 373 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products/_search?pretty' -H 'Content-Type: application/json' -d '{
320 "size": 5, 374 "size": 5,
321 "query": { 375 "query": {
322 "bool": { 376 "bool": {
@@ -333,7 +387,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products/_s @@ -333,7 +387,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products/_s
333 ### 5. 统计租户总文档数 387 ### 5. 统计租户总文档数
334 388
335 ```bash 389 ```bash
336 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_count?pretty' -H 'Content-Type: application/json' -d '{ 390 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_count?pretty' -H 'Content-Type: application/json' -d '{
337 "query": { 391 "query": {
338 "match_all": {} 392 "match_all": {}
339 } 393 }
@@ -348,7 +402,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te @@ -348,7 +402,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
348 402
349 #### 1.1 查询特定租户的商品,显示分面相关字段 403 #### 1.1 查询特定租户的商品,显示分面相关字段
350 ```bash 404 ```bash
351 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ 405 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
352 "query": { 406 "query": {
353 "term": { "tenant_id": "162" } 407 "term": { "tenant_id": "162" }
354 }, 408 },
@@ -363,7 +417,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te @@ -363,7 +417,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
363 417
364 #### 1.2 验证 category1_name 字段是否有数据 418 #### 1.2 验证 category1_name 字段是否有数据
365 ```bash 419 ```bash
366 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ 420 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
367 "query": { 421 "query": {
368 "bool": { 422 "bool": {
369 "filter": [ 423 "filter": [
@@ -378,7 +432,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te @@ -378,7 +432,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
378 432
379 #### 1.3 验证 specifications 字段是否有数据 433 #### 1.3 验证 specifications 字段是否有数据
380 ```bash 434 ```bash
381 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ 435 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
382 "query": { 436 "query": {
383 "bool": { 437 "bool": {
384 "filter": [ 438 "filter": [
@@ -397,7 +451,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te @@ -397,7 +451,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
397 451
398 #### 2.1 category1_name 分面聚合 452 #### 2.1 category1_name 分面聚合
399 ```bash 453 ```bash
400 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ 454 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
401 "query": { "match_all": {} }, 455 "query": { "match_all": {} },
402 "size": 0, 456 "size": 0,
403 "aggs": { 457 "aggs": {
@@ -410,7 +464,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te @@ -410,7 +464,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
410 464
411 #### 2.2 specifications.color 分面聚合 465 #### 2.2 specifications.color 分面聚合
412 ```bash 466 ```bash
413 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ 467 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
414 "query": { "match_all": {} }, 468 "query": { "match_all": {} },
415 "size": 0, 469 "size": 0,
416 "aggs": { 470 "aggs": {
@@ -431,7 +485,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te @@ -431,7 +485,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
431 485
432 #### 2.3 specifications.size 分面聚合 486 #### 2.3 specifications.size 分面聚合
433 ```bash 487 ```bash
434 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ 488 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
435 "query": { "match_all": {} }, 489 "query": { "match_all": {} },
436 "size": 0, 490 "size": 0,
437 "aggs": { 491 "aggs": {
@@ -452,7 +506,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te @@ -452,7 +506,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
452 506
453 #### 2.4 specifications.material 分面聚合 507 #### 2.4 specifications.material 分面聚合
454 ```bash 508 ```bash
455 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ 509 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
456 "query": { "match_all": {} }, 510 "query": { "match_all": {} },
457 "size": 0, 511 "size": 0,
458 "aggs": { 512 "aggs": {
@@ -473,7 +527,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te @@ -473,7 +527,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
473 527
474 #### 2.5 综合分面聚合(category + color + size + material) 528 #### 2.5 综合分面聚合(category + color + size + material)
475 ```bash 529 ```bash
476 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ 530 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
477 "query": { "match_all": {} }, 531 "query": { "match_all": {} },
478 "size": 0, 532 "size": 0,
479 "aggs": { 533 "aggs": {
@@ -515,7 +569,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te @@ -515,7 +569,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
515 569
516 #### 3.1 查看 specifications 的 name 字段有哪些值 570 #### 3.1 查看 specifications 的 name 字段有哪些值
517 ```bash 571 ```bash
518 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products/_search?pretty' -H 'Content-Type: application/json' -d '{ 572 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products/_search?pretty' -H 'Content-Type: application/json' -d '{
519 "query": { "term": { "tenant_id": "162" } }, 573 "query": { "term": { "tenant_id": "162" } },
520 "size": 0, 574 "size": 0,
521 "aggs": { 575 "aggs": {
@@ -531,7 +585,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products/_s @@ -531,7 +585,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products/_s
531 585
532 #### 3.2 查看某个商品的完整 specifications 数据 586 #### 3.2 查看某个商品的完整 specifications 数据
533 ```bash 587 ```bash
534 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products/_search?pretty' -H 'Content-Type: application/json' -d '{ 588 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products/_search?pretty' -H 'Content-Type: application/json' -d '{
535 "query": { 589 "query": {
536 "bool": { 590 "bool": {
537 "filter": [ 591 "filter": [
@@ -552,7 +606,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products/_s @@ -552,7 +606,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products/_s
552 **keyword 精确匹配**(示例词:中文 `法式风格`,英文 `long skirt`) 606 **keyword 精确匹配**(示例词:中文 `法式风格`,英文 `long skirt`)
553 607
554 ```bash 608 ```bash
555 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ 609 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
556 "size": 1, 610 "size": 1,
557 "_source": ["spu_id", "title", "enriched_attributes"], 611 "_source": ["spu_id", "title", "enriched_attributes"],
558 "query": { 612 "query": {
@@ -575,7 +629,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te @@ -575,7 +629,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
575 **text 全文匹配**(经 `index_ik` / `english` 分词;可与上式对照) 629 **text 全文匹配**(经 `index_ik` / `english` 分词;可与上式对照)
576 630
577 ```bash 631 ```bash
578 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ 632 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
579 "size": 1, 633 "size": 1,
580 "_source": ["spu_id", "title", "enriched_attributes"], 634 "_source": ["spu_id", "title", "enriched_attributes"],
581 "query": { 635 "query": {
@@ -602,7 +656,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te @@ -602,7 +656,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
602 **keyword 精确匹配** 656 **keyword 精确匹配**
603 657
604 ```bash 658 ```bash
605 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ 659 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
606 "size": 1, 660 "size": 1,
607 "_source": ["spu_id", "title", "option1_values"], 661 "_source": ["spu_id", "title", "option1_values"],
608 "query": { 662 "query": {
@@ -620,7 +674,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te @@ -620,7 +674,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
620 **text 全文匹配** 674 **text 全文匹配**
621 675
622 ```bash 676 ```bash
623 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ 677 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
624 "size": 1, 678 "size": 1,
625 "_source": ["spu_id", "title", "option1_values"], 679 "_source": ["spu_id", "title", "option1_values"],
626 "query": { 680 "query": {
@@ -640,7 +694,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te @@ -640,7 +694,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
640 **keyword 精确匹配** 694 **keyword 精确匹配**
641 695
642 ```bash 696 ```bash
643 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ 697 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
644 "size": 1, 698 "size": 1,
645 "_source": ["spu_id", "title", "enriched_tags"], 699 "_source": ["spu_id", "title", "enriched_tags"],
646 "query": { 700 "query": {
@@ -658,7 +712,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te @@ -658,7 +712,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
658 **text 全文匹配** 712 **text 全文匹配**
659 713
660 ```bash 714 ```bash
661 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ 715 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
662 "size": 1, 716 "size": 1,
663 "_source": ["spu_id", "title", "enriched_tags"], 717 "_source": ["spu_id", "title", "enriched_tags"],
664 "query": { 718 "query": {
@@ -678,7 +732,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te @@ -678,7 +732,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
678 > `specifications` 为 **nested**,`value_keyword` 为整词匹配;`value_text.*` 可同时 `term` 子字段或 `match` 主 text。 732 > `specifications` 为 **nested**,`value_keyword` 为整词匹配;`value_text.*` 可同时 `term` 子字段或 `match` 主 text。
679 733
680 ```bash 734 ```bash
681 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ 735 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
682 "size": 1, 736 "size": 1,
683 "_source": ["spu_id", "title", "specifications"], 737 "_source": ["spu_id", "title", "specifications"],
684 "query": { 738 "query": {
@@ -710,7 +764,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te @@ -710,7 +764,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
710 764
711 #### 4.1 统计有 category1_name 的文档数量 765 #### 4.1 统计有 category1_name 的文档数量
712 ```bash 766 ```bash
713 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_count?pretty' -H 'Content-Type: application/json' -d '{ 767 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_count?pretty' -H 'Content-Type: application/json' -d '{
714 "query": { 768 "query": {
715 "bool": { 769 "bool": {
716 "filter": [ 770 "filter": [
@@ -723,7 +777,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te @@ -723,7 +777,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
723 777
724 #### 4.2 统计有 specifications 的文档数量 778 #### 4.2 统计有 specifications 的文档数量
725 ```bash 779 ```bash
726 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_count?pretty' -H 'Content-Type: application/json' -d '{ 780 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_count?pretty' -H 'Content-Type: application/json' -d '{
727 "query": { 781 "query": {
728 "bool": { 782 "bool": {
729 "filter": [ 783 "filter": [
@@ -740,7 +794,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te @@ -740,7 +794,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
740 794
741 #### 5.1 查找没有 category1_name 但有 category 的文档(MySQL 有数据但 ES 没有) 795 #### 5.1 查找没有 category1_name 但有 category 的文档(MySQL 有数据但 ES 没有)
742 ```bash 796 ```bash
743 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ 797 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
744 "query": { 798 "query": {
745 "bool": { 799 "bool": {
746 "filter": [ 800 "filter": [
@@ -758,7 +812,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te @@ -758,7 +812,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
758 812
759 #### 5.2 查找有 option 但没有 specifications 的文档(数据转换问题) 813 #### 5.2 查找有 option 但没有 specifications 的文档(数据转换问题)
760 ```bash 814 ```bash
761 -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ 815 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
762 "query": { 816 "query": {
763 "bool": { 817 "bool": {
764 "filter": [ 818 "filter": [
@@ -814,7 +868,7 @@ GET search_products_tenant_163/_mapping @@ -814,7 +868,7 @@ GET search_products_tenant_163/_mapping
814 GET search_products_tenant_163/_field_caps?fields=* 868 GET search_products_tenant_163/_field_caps?fields=*
815 869
816 ```bash 870 ```bash
817 -curl -u 'saas:4hOaLaf41y2VuI8y' -X POST \ 871 +curl -u "$ES_AUTH" -X POST \
818 'http://localhost:9200/search_products_tenant_163/_count' \ 872 'http://localhost:9200/search_products_tenant_163/_count' \
819 -H 'Content-Type: application/json' \ 873 -H 'Content-Type: application/json' \
820 -d '{ 874 -d '{
@@ -827,7 +881,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X POST \ @@ -827,7 +881,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X POST \
827 } 881 }
828 }' 882 }'
829 883
830 -curl -u 'saas:4hOaLaf41y2VuI8y' -X POST \ 884 +curl -u "$ES_AUTH" -X POST \
831 'http://localhost:9200/search_products_tenant_163/_count' \ 885 'http://localhost:9200/search_products_tenant_163/_count' \
832 -H 'Content-Type: application/json' \ 886 -H 'Content-Type: application/json' \
833 -d '{ 887 -d '{
docs/性能测试报告.md
@@ -18,13 +18,13 @@ @@ -18,13 +18,13 @@
18 18
19 执行方式: 19 执行方式:
20 - 每组压测持续 `20s` 20 - 每组压测持续 `20s`
21 -- 使用统一脚本 `scripts/perf_api_benchmark.py` 21 +- 使用统一脚本 `benchmarks/perf_api_benchmark.py`
22 - 通过 `--scenario` 多值 + `--concurrency-list` 一次性跑完 `场景 x 并发` 22 - 通过 `--scenario` 多值 + `--concurrency-list` 一次性跑完 `场景 x 并发`
23 23
24 ## 3. 压测工具优化说明(复用现有脚本) 24 ## 3. 压测工具优化说明(复用现有脚本)
25 25
26 为了解决原脚本“一次只能跑一个场景+一个并发”的可用性问题,本次直接扩展现有脚本: 26 为了解决原脚本“一次只能跑一个场景+一个并发”的可用性问题,本次直接扩展现有脚本:
27 -- `scripts/perf_api_benchmark.py` 27 +- `benchmarks/perf_api_benchmark.py`
28 28
29 能力: 29 能力:
30 - 一条命令执行 `场景列表 x 并发列表` 全矩阵 30 - 一条命令执行 `场景列表 x 并发列表` 全矩阵
@@ -33,7 +33,7 @@ @@ -33,7 +33,7 @@
33 示例: 33 示例:
34 34
35 ```bash 35 ```bash
36 -.venv/bin/python scripts/perf_api_benchmark.py \ 36 +.venv/bin/python benchmarks/perf_api_benchmark.py \
37 --scenario backend_search,backend_suggest,embed_text,rerank \ 37 --scenario backend_search,backend_suggest,embed_text,rerank \
38 --concurrency-list 1,5,10,20 \ 38 --concurrency-list 1,5,10,20 \
39 --duration 20 \ 39 --duration 20 \
@@ -106,7 +106,7 @@ curl -sS http://127.0.0.1:6007/health @@ -106,7 +106,7 @@ curl -sS http://127.0.0.1:6007/health
106 106
107 ```bash 107 ```bash
108 cd /data/saas-search 108 cd /data/saas-search
109 -.venv/bin/python scripts/perf_api_benchmark.py \ 109 +.venv/bin/python benchmarks/perf_api_benchmark.py \
110 --scenario backend_search,backend_suggest,embed_text,rerank \ 110 --scenario backend_search,backend_suggest,embed_text,rerank \
111 --concurrency-list 1,5,10,20 \ 111 --concurrency-list 1,5,10,20 \
112 --duration 20 \ 112 --duration 20 \
@@ -164,7 +164,7 @@ cd /data/saas-search @@ -164,7 +164,7 @@ cd /data/saas-search
164 复现命令: 164 复现命令:
165 165
166 ```bash 166 ```bash
167 -.venv/bin/python scripts/perf_api_benchmark.py \ 167 +.venv/bin/python benchmarks/perf_api_benchmark.py \
168 --scenario rerank \ 168 --scenario rerank \
169 --duration 20 \ 169 --duration 20 \
170 --concurrency-list 1,5,10,20 \ 170 --concurrency-list 1,5,10,20 \
@@ -237,7 +237,7 @@ cd /data/saas-search @@ -237,7 +237,7 @@ cd /data/saas-search
237 - 使用项目虚拟环境执行: 237 - 使用项目虚拟环境执行:
238 238
239 ```bash 239 ```bash
240 -.venv/bin/python scripts/perf_api_benchmark.py -h 240 +.venv/bin/python benchmarks/perf_api_benchmark.py -h
241 ``` 241 ```
242 242
243 ### 10.3 某场景成功率下降 243 ### 10.3 某场景成功率下降
@@ -249,7 +249,7 @@ cd /data/saas-search @@ -249,7 +249,7 @@ cd /data/saas-search
249 249
250 ## 11. 关联文件 250 ## 11. 关联文件
251 251
252 -- 压测脚本:`scripts/perf_api_benchmark.py` 252 +- 压测脚本:`benchmarks/perf_api_benchmark.py`
253 - 本次结果:`perf_reports/2026-03-12/perf_matrix_report.json` 253 - 本次结果:`perf_reports/2026-03-12/perf_matrix_report.json`
254 - Search 多租户补测:`perf_reports/2026-03-12/search_tenant_matrix/` 254 - Search 多租户补测:`perf_reports/2026-03-12/search_tenant_matrix/`
255 - Reranker 386 docs 口径补测:`perf_reports/2026-03-12/rerank_realistic/rerank_386docs.json` 255 - Reranker 386 docs 口径补测:`perf_reports/2026-03-12/rerank_realistic/rerank_386docs.json`
@@ -280,7 +280,7 @@ cd /data/saas-search @@ -280,7 +280,7 @@ cd /data/saas-search
280 cd /data/saas-search 280 cd /data/saas-search
281 mkdir -p perf_reports/2026-03-12/search_tenant_matrix 281 mkdir -p perf_reports/2026-03-12/search_tenant_matrix
282 for t in 0 1 2 3 4; do 282 for t in 0 1 2 3 4; do
283 - .venv/bin/python scripts/perf_api_benchmark.py \ 283 + .venv/bin/python benchmarks/perf_api_benchmark.py \
284 --scenario backend_search \ 284 --scenario backend_search \
285 --concurrency-list 1,5,10,20 \ 285 --concurrency-list 1,5,10,20 \
286 --duration 20 \ 286 --duration 20 \
docs/搜索API对接指南-00-总览与快速开始.md
@@ -90,7 +90,7 @@ curl -X POST &quot;http://43.166.252.75:6002/search/&quot; \ @@ -90,7 +90,7 @@ curl -X POST &quot;http://43.166.252.75:6002/search/&quot; \
90 | 查询文档 | POST | `/indexer/documents` | 查询SPU文档数据(不写入ES) | 90 | 查询文档 | POST | `/indexer/documents` | 查询SPU文档数据(不写入ES) |
91 | 构建ES文档(正式对接) | POST | `/indexer/build-docs` | 基于上游提供的 MySQL 行数据构建 ES doc,不写入 ES,供 Java 等调用后自行写入 | 91 | 构建ES文档(正式对接) | POST | `/indexer/build-docs` | 基于上游提供的 MySQL 行数据构建 ES doc,不写入 ES,供 Java 等调用后自行写入 |
92 | 构建ES文档(测试用) | POST | `/indexer/build-docs-from-db` | 仅在测试/调试时使用,根据 `tenant_id + spu_ids` 内部查库并构建 ES doc | 92 | 构建ES文档(测试用) | POST | `/indexer/build-docs-from-db` | 仅在测试/调试时使用,根据 `tenant_id + spu_ids` 内部查库并构建 ES doc |
93 -| 内容理解字段生成 | POST | `/indexer/enrich-content` | 根据商品标题批量生成 qanchors、enriched_attributes、tags,供微服务组合方式使用 | 93 +| 内容理解字段生成 | POST | `/indexer/enrich-content` | 根据商品标题批量生成 qanchors、enriched_attributes、tags,供微服务组合方式使用(独立服务端口 6001) |
94 | 索引健康检查 | GET | `/indexer/health` | 检查索引服务状态 | 94 | 索引健康检查 | GET | `/indexer/health` | 检查索引服务状态 |
95 | 健康检查 | GET | `/admin/health` | 服务健康检查 | 95 | 健康检查 | GET | `/admin/health` | 服务健康检查 |
96 | 获取配置 | GET | `/admin/config` | 获取租户配置 | 96 | 获取配置 | GET | `/admin/config` | 获取租户配置 |
@@ -104,7 +104,6 @@ curl -X POST &quot;http://43.166.252.75:6002/search/&quot; \ @@ -104,7 +104,6 @@ curl -X POST &quot;http://43.166.252.75:6002/search/&quot; \
104 | 向量服务(图片) | 6008 | `POST /embed/image` | 图片向量化 | 104 | 向量服务(图片) | 6008 | `POST /embed/image` | 图片向量化 |
105 | 翻译服务 | 6006 | `POST /translate` | 文本翻译(支持 qwen-mt / llm / deepl / 本地模型) | 105 | 翻译服务 | 6006 | `POST /translate` | 文本翻译(支持 qwen-mt / llm / deepl / 本地模型) |
106 | 重排服务 | 6007 | `POST /rerank` | 检索结果重排 | 106 | 重排服务 | 6007 | `POST /rerank` | 检索结果重排 |
107 -| 内容理解(Indexer 内) | 6004 | `POST /indexer/enrich-content` | 根据商品标题生成 qanchors、tags 等,供 indexer 微服务组合方式使用 | 107 +| 内容理解(独立服务) | 6001 | `POST /indexer/enrich-content` | 根据商品标题生成 qanchors、tags 等,供 indexer 微服务组合方式使用 |
108 108
109 --- 109 ---
110 -  
docs/搜索API对接指南-05-索引接口(Indexer).md
@@ -13,7 +13,7 @@ @@ -13,7 +13,7 @@
13 | 查询文档 | POST | `/indexer/documents` | 按 SPU ID 列表查询 ES 文档,不写入 ES | 13 | 查询文档 | POST | `/indexer/documents` | 按 SPU ID 列表查询 ES 文档,不写入 ES |
14 | 构建 ES 文档(正式) | POST | `/indexer/build-docs` | 由上游提供 MySQL 行数据,返回 ES-ready 文档,不写 ES | 14 | 构建 ES 文档(正式) | POST | `/indexer/build-docs` | 由上游提供 MySQL 行数据,返回 ES-ready 文档,不写 ES |
15 | 构建 ES 文档(测试) | POST | `/indexer/build-docs-from-db` | 由本服务查库并构建文档,仅测试/调试用 | 15 | 构建 ES 文档(测试) | POST | `/indexer/build-docs-from-db` | 由本服务查库并构建文档,仅测试/调试用 |
16 -| 内容理解字段生成 | POST | `/indexer/enrich-content` | 根据商品标题批量生成 qanchors、enriched_attributes、tags(供微服务组合方式使用) | 16 +| 内容理解字段生成 | POST | `/indexer/enrich-content` | 根据商品标题批量生成 qanchors、enriched_attributes、tags(供微服务组合方式使用;独立服务端口 6001) |
17 | 索引健康检查 | GET | `/indexer/health` | 检查索引服务与数据库连接状态 | 17 | 索引健康检查 | GET | `/indexer/health` | 检查索引服务与数据库连接状态 |
18 18
19 #### 5.0 支撑外部 indexer 的三种方式 19 #### 5.0 支撑外部 indexer 的三种方式
@@ -23,7 +23,7 @@ @@ -23,7 +23,7 @@
23 | 方式 | 说明 | 适用场景 | 23 | 方式 | 说明 | 适用场景 |
24 |------|------|----------| 24 |------|------|----------|
25 | **1)doc 填充接口** | 调用 `POST /indexer/build-docs` 或 `POST /indexer/build-docs-from-db`,由本服务基于 MySQL 行数据构建完整 ES 文档(含多语言、向量、规格等),**不写入 ES**,由调用方自行写入。 | 希望一站式拿到 ES-ready doc,由己方控制写 ES 的时机与索引名。 | 25 | **1)doc 填充接口** | 调用 `POST /indexer/build-docs` 或 `POST /indexer/build-docs-from-db`,由本服务基于 MySQL 行数据构建完整 ES 文档(含多语言、向量、规格等),**不写入 ES**,由调用方自行写入。 | 希望一站式拿到 ES-ready doc,由己方控制写 ES 的时机与索引名。 |
26 -| **2)微服务组合** | 单独调用**翻译**、**向量化**、**内容理解字段生成**等接口,由 indexer 程序自己组装 doc 并写入 ES。翻译与向量化为独立微服务(见第 7 节);内容理解为 Indexer 服务内接口 `POST /indexer/enrich-content`。 | 需要灵活编排、或希望将 LLM/向量等耗时步骤与主链路解耦(如异步补齐 qanchors/tags)。 | 26 +| **2)微服务组合** | 单独调用**翻译**、**向量化**、**内容理解字段生成**等接口,由 indexer 程序自己组装 doc 并写入 ES。翻译与向量化为独立微服务(见第 7 节);内容理解为 FacetAwareMatching 独立服务接口 `POST /indexer/enrich-content`(端口 6001)。 | 需要灵活编排、或希望将 LLM/向量等耗时步骤与主链路解耦(如异步补齐 qanchors/tags)。 |
27 | **3)本服务直接写 ES** | 调用全量索引 `POST /indexer/reindex`、增量索引 `POST /indexer/index`(指定 SPU ID 列表),由本服务从 MySQL 拉数并直接写入 ES。 | 自建运维、联调或不需要由 Java 写 ES 的场景。 | 27 | **3)本服务直接写 ES** | 调用全量索引 `POST /indexer/reindex`、增量索引 `POST /indexer/index`(指定 SPU ID 列表),由本服务从 MySQL 拉数并直接写入 ES。 | 自建运维、联调或不需要由 Java 写 ES 的场景。 |
28 28
29 - **方式 1** 与 **方式 2** 下,ES 的写入方均为外部 indexer(或 Java),职责清晰。 29 - **方式 1** 与 **方式 2** 下,ES 的写入方均为外部 indexer(或 Java),职责清晰。
@@ -498,7 +498,7 @@ curl -X GET &quot;http://localhost:6004/indexer/health&quot; @@ -498,7 +498,7 @@ curl -X GET &quot;http://localhost:6004/indexer/health&quot;
498 498
499 #### 请求示例(完整 curl) 499 #### 请求示例(完整 curl)
500 500
501 -> 完整请求体参考 `scripts/test_build_docs_api.py` 中的 `build_sample_request()`。 501 +> 完整请求体参考 `tests/manual/test_build_docs_api.py` 中的 `build_sample_request()`。
502 502
503 ```bash 503 ```bash
504 # 单条 SPU 示例(含 spu、skus、options) 504 # 单条 SPU 示例(含 spu、skus、options)
@@ -648,13 +648,38 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \ @@ -648,13 +648,38 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \
648 ### 5.8 内容理解字段生成接口 648 ### 5.8 内容理解字段生成接口
649 649
650 - **端点**: `POST /indexer/enrich-content` 650 - **端点**: `POST /indexer/enrich-content`
651 -- **描述**: 根据商品内容信息批量生成 **qanchors**(锚文本)、**enriched_attributes**(语义属性)、**enriched_tags**(细分标签),供外部 indexer 在「微服务组合」方式下自行拼装 doc 时使用。请求以 `items[]` 传入商品内容字段(必填/可选见下表)。接口只暴露商品内容输入,语言选择、分析维度与最终字段结构统一由 `indexer.product_enrich` 内部决定;当前返回结果与 `search_products` mapping 保持一致。单次请求在线程池中执行,避免阻塞其他接口。 651 +- **服务**: FacetAwareMatching 独立服务(默认端口 **6001**;由 `/data/FacetAwareMatching/scripts/service_ctl.sh` 管理)
  652 +- **描述**: 根据商品内容信息批量生成 **qanchors**(锚文本)、**enriched_attributes**(通用语义属性)、**enriched_tags**(细分标签)、**enriched_taxonomy_attributes**(taxonomy 结构化属性),供外部 indexer 在「微服务组合」方式下自行拼装 doc 时使用。请求以 `items[]` 传入商品内容字段(必填/可选见下表)。接口只暴露商品内容输入,语言选择、分析维度与最终字段结构统一由 FacetAwareMatching 的 `product_enrich` 内部决定;当前返回结果与 `search_products` mapping 保持一致。单次请求在线程池中执行,避免阻塞其他接口。
  653 +
  654 +当前支持的 `category_taxonomy_profile`:
  655 +- `apparel`
  656 +- `3c`
  657 +- `bags`
  658 +- `pet_supplies`
  659 +- `electronics`
  660 +- `outdoor`
  661 +- `home_appliances`
  662 +- `home_living`
  663 +- `wigs`
  664 +- `beauty`
  665 +- `accessories`
  666 +- `toys`
  667 +- `shoes`
  668 +- `sports`
  669 +- `others`
  670 +
  671 +说明:
  672 +- 所有 profile 的 `enriched_taxonomy_attributes.value` 都统一返回 `zh` + `en`。
  673 +- 外部调用 `/indexer/enrich-content` 时,以请求中的 `category_taxonomy_profile` 为准。
  674 +- 若 indexer 内部仍接入内容理解能力,taxonomy profile 请在调用侧显式传入(建议仍以租户行业配置为准)。
652 675
653 #### 请求参数 676 #### 请求参数
654 677
655 ```json 678 ```json
656 { 679 {
657 "tenant_id": "170", 680 "tenant_id": "170",
  681 + "enrichment_scopes": ["generic", "category_taxonomy"],
  682 + "category_taxonomy_profile": "apparel",
658 "items": [ 683 "items": [
659 { 684 {
660 "spu_id": "223167", 685 "spu_id": "223167",
@@ -675,6 +700,8 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \ @@ -675,6 +700,8 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \
675 | 参数 | 类型 | 必填 | 默认值 | 说明 | 700 | 参数 | 类型 | 必填 | 默认值 | 说明 |
676 |------|------|------|--------|------| 701 |------|------|------|--------|------|
677 | `tenant_id` | string | Y | - | 租户 ID。目前仅用于记录日志,不产生实际作用| 702 | `tenant_id` | string | Y | - | 租户 ID。目前仅用于记录日志,不产生实际作用|
  703 +| `enrichment_scopes` | array[string] | N | `["generic", "category_taxonomy"]` | 选择要执行的增强范围。`generic` 生成 `qanchors`/`enriched_tags`/`enriched_attributes`,`category_taxonomy` 生成 `enriched_taxonomy_attributes` |
  704 +| `category_taxonomy_profile` | string | N | `apparel` | 品类 taxonomy profile。支持:`apparel`、`3c`、`bags`、`pet_supplies`、`electronics`、`outdoor`、`home_appliances`、`home_living`、`wigs`、`beauty`、`accessories`、`toys`、`shoes`、`sports`、`others` |
678 | `items` | array | Y | - | 待分析列表;**单次最多 50 条** | 705 | `items` | array | Y | - | 待分析列表;**单次最多 50 条** |
679 706
680 `items[]` 字段说明: 707 `items[]` 字段说明:
@@ -683,21 +710,24 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \ @@ -683,21 +710,24 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \
683 |------|------|------|------| 710 |------|------|------|------|
684 | `spu_id` | string | Y | SPU ID,用于回填结果;目前仅用于记录日志,不产生实际作用| 711 | `spu_id` | string | Y | SPU ID,用于回填结果;目前仅用于记录日志,不产生实际作用|
685 | `title` | string | Y | 商品标题 | 712 | `title` | string | Y | 商品标题 |
686 -| `image_url` | string | N | 商品主图 URL;当前会参与内容缓存键,后续可用于图像/多模态内容理解 |  
687 -| `brief` | string | N | 商品简介/短描述;当前会参与内容缓存键 |  
688 -| `description` | string | N | 商品详情/长描述;当前会参与内容缓存键 | 713 +| `image_url` | string | N | 商品主图 URL;当前仅透传,暂未参与 prompt 与缓存键,后续可用于图像/多模态内容理解 |
  714 +| `brief` | string | N | 商品简介/短描述;当前会参与 prompt 与缓存键 |
  715 +| `description` | string | N | 商品详情/长描述;当前会参与 prompt 与缓存键 |
689 716
690 缓存说明: 717 缓存说明:
691 718
692 -- 内容缓存键仅由 `target_lang + items[]` 中会影响内容理解结果的输入文本构成,目前包括:`title`、`brief`、`description`、`image_url` 的规范化内容 hash。 719 +- 内容缓存按 **增强范围 + taxonomy profile** 拆分;`generic` 与 `category_taxonomy:apparel` 等使用不同缓存命名空间,互不污染、可独立演进。
  720 +- 缓存键由 `analysis_kind + target_lang + prompt/schema 版本指纹 + prompt 输入文本 hash` 构成;对 category taxonomy 来说,profile 会进入 schema 标识与版本指纹。
  721 +- 当前真正参与 prompt 输入的字段是:`title`、`brief`、`description`;这些字段任一变化,都会落到新的缓存 key。
  722 +- `prompt/schema 版本指纹` 会综合 system prompt、shared instruction、localized table headers、result fields、user instruction template 等信息生成;因此只要提示词或输出契约变化,旧缓存会自然失效。
693 - `tenant_id`、`spu_id` 只用于请求归属与结果回填,不参与缓存键。 723 - `tenant_id`、`spu_id` 只用于请求归属与结果回填,不参与缓存键。
694 -- 因此,输入内容不变时可跨请求直接命中缓存;任一输入字段变化时,会自然落到新的缓存 key。 724 +- 因此,输入内容与 prompt 契约都不变时可跨请求直接命中缓存;任一一侧变化,都会自然落到新的缓存 key。
695 725
696 语言说明: 726 语言说明:
697 727
698 - 接口不接受语言控制参数。 728 - 接口不接受语言控制参数。
699 - 返回哪些语言、返回哪些语义维度,统一由 `indexer.product_enrich` 内部逻辑决定。 729 - 返回哪些语言、返回哪些语义维度,统一由 `indexer.product_enrich` 内部逻辑决定。
700 -- 当前为了与 `search_products` mapping 对齐,返回结果只包含核心索引语言 `zh`、`en`。 730 +- 当前为了与 `search_products` mapping 对齐,通用增强字段与 taxonomy 字段都统一只返回核心索引语言 `zh`、`en`。
701 731
702 批量请求建议: 732 批量请求建议:
703 - **全量**:强烈建议 尽可能 **20 个 SPU/doc** 攒成一个批次后再请求一次。 733 - **全量**:强烈建议 尽可能 **20 个 SPU/doc** 攒成一个批次后再请求一次。
@@ -709,6 +739,8 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \ @@ -709,6 +739,8 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \
709 ```json 739 ```json
710 { 740 {
711 "tenant_id": "170", 741 "tenant_id": "170",
  742 + "enrichment_scopes": ["generic", "category_taxonomy"],
  743 + "category_taxonomy_profile": "apparel",
712 "total": 2, 744 "total": 2,
713 "results": [ 745 "results": [
714 { 746 {
@@ -725,6 +757,11 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \ @@ -725,6 +757,11 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \
725 { "name": "enriched_tags", "value": { "zh": "纯棉" } }, 757 { "name": "enriched_tags", "value": { "zh": "纯棉" } },
726 { "name": "usage_scene", "value": { "zh": "日常" } }, 758 { "name": "usage_scene", "value": { "zh": "日常" } },
727 { "name": "enriched_tags", "value": { "en": "cotton" } } 759 { "name": "enriched_tags", "value": { "en": "cotton" } }
  760 + ],
  761 + "enriched_taxonomy_attributes": [
  762 + { "name": "Product Type", "value": { "zh": ["T恤"], "en": ["t-shirt"] } },
  763 + { "name": "Target Gender", "value": { "zh": ["男"], "en": ["men"] } },
  764 + { "name": "Season", "value": { "zh": ["夏季"], "en": ["summer"] } }
728 ] 765 ]
729 }, 766 },
730 { 767 {
@@ -735,7 +772,8 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \ @@ -735,7 +772,8 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \
735 "enriched_tags": { 772 "enriched_tags": {
736 "en": ["dolls", "toys"] 773 "en": ["dolls", "toys"]
737 }, 774 },
738 - "enriched_attributes": [] 775 + "enriched_attributes": [],
  776 + "enriched_taxonomy_attributes": []
739 } 777 }
740 ] 778 ]
741 } 779 }
@@ -743,10 +781,13 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \ @@ -743,10 +781,13 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \
743 781
744 | 字段 | 类型 | 说明 | 782 | 字段 | 类型 | 说明 |
745 |------|------|------| 783 |------|------|------|
746 -| `results` | array | 与请求 `items` 一一对应,每项含 `spu_id`、`qanchors`、`enriched_attributes`、`enriched_tags` | 784 +| `enrichment_scopes` | array | 实际执行的增强范围列表 |
  785 +| `category_taxonomy_profile` | string | 实际使用的品类 taxonomy profile |
  786 +| `results` | array | 与请求 `items` 一一对应,每项含 `spu_id`、`qanchors`、`enriched_attributes`、`enriched_tags`、`enriched_taxonomy_attributes` |
747 | `results[].qanchors` | object | 与 ES `qanchors` 字段同结构,按语言键返回短语数组 | 787 | `results[].qanchors` | object | 与 ES `qanchors` 字段同结构,按语言键返回短语数组 |
748 | `results[].enriched_tags` | object | 与 ES `enriched_tags` 字段同结构,按语言键返回标签数组 | 788 | `results[].enriched_tags` | object | 与 ES `enriched_tags` 字段同结构,按语言键返回标签数组 |
749 | `results[].enriched_attributes` | array | 与 ES `enriched_attributes` nested 字段同结构,每项为 `{ "name", "value": { "zh"?: "...", "en"?: "..." } }` | 789 | `results[].enriched_attributes` | array | 与 ES `enriched_attributes` nested 字段同结构,每项为 `{ "name", "value": { "zh"?: "...", "en"?: "..." } }` |
  790 +| `results[].enriched_taxonomy_attributes` | array | 与 ES `enriched_taxonomy_attributes` nested 字段同结构。每项通常为 `{ "name", "value": { "zh"?: [...], "en"?: [...] } }` |
750 | `results[].error` | string | 若该条处理失败(如 LLM 异常),会在此字段返回错误信息 | 791 | `results[].error` | string | 若该条处理失败(如 LLM 异常),会在此字段返回错误信息 |
751 792
752 **错误响应**: 793 **错误响应**:
@@ -756,10 +797,12 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \ @@ -756,10 +797,12 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \
756 #### 请求示例 797 #### 请求示例
757 798
758 ```bash 799 ```bash
759 -curl -X POST "http://localhost:6004/indexer/enrich-content" \ 800 +curl -X POST "http://localhost:6001/indexer/enrich-content" \
760 -H "Content-Type: application/json" \ 801 -H "Content-Type: application/json" \
761 -d '{ 802 -d '{
762 - "tenant_id": "170", 803 + "tenant_id": "163",
  804 + "enrichment_scopes": ["generic", "category_taxonomy"],
  805 + "category_taxonomy_profile": "apparel",
763 "items": [ 806 "items": [
764 { 807 {
765 "spu_id": "223167", 808 "spu_id": "223167",
@@ -773,4 +816,3 @@ curl -X POST &quot;http://localhost:6004/indexer/enrich-content&quot; \ @@ -773,4 +816,3 @@ curl -X POST &quot;http://localhost:6004/indexer/enrich-content&quot; \
773 ``` 816 ```
774 817
775 --- 818 ---
776 -  
docs/搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md
@@ -444,7 +444,7 @@ curl &quot;http://localhost:6006/health&quot; @@ -444,7 +444,7 @@ curl &quot;http://localhost:6006/health&quot;
444 444
445 - **Base URL**: Indexer 服务地址,如 `http://localhost:6004` 445 - **Base URL**: Indexer 服务地址,如 `http://localhost:6004`
446 - **路径**: `POST /indexer/enrich-content` 446 - **路径**: `POST /indexer/enrich-content`
447 -- **说明**: 根据商品标题批量生成 `qanchors`、`enriched_attributes`、`tags`,用于拼装 ES 文档。内部使用大模型(需配置 `DASHSCOPE_API_KEY`),支持多语言与 Redis 缓存;单次最多 50 条,建议批量调用以提升效率。 447 +- **说明**: 根据商品标题批量生成 `qanchors`、`enriched_attributes`、`enriched_tags`、`enriched_taxonomy_attributes`,用于拼装 ES 文档。支持通过 `enrichment_scopes` 选择执行 `generic` / `category_taxonomy`,并通过 `category_taxonomy_profile` 选择对应大类的 taxonomy prompt/profile;默认执行 `generic + category_taxonomy(apparel)`。当前支持的 taxonomy profile 包括 `apparel`、`3c`、`bags`、`pet_supplies`、`electronics`、`outdoor`、`home_appliances`、`home_living`、`wigs`、`beauty`、`accessories`、`toys`、`shoes`、`sports`、`others`。所有 profile 的 taxonomy 输出都统一返回 `zh` + `en`,`category_taxonomy_profile` 只决定字段集合。内部使用大模型(需配置 `DASHSCOPE_API_KEY`),支持多语言与 Redis 缓存;单次最多 50 条,建议批量调用以提升效率。
448 448
449 请求/响应格式、示例及错误码见 [-05-索引接口(Indexer)](./搜索API对接指南-05-索引接口(Indexer).md#58-内容理解字段生成接口)。 449 请求/响应格式、示例及错误码见 [-05-索引接口(Indexer)](./搜索API对接指南-05-索引接口(Indexer).md#58-内容理解字段生成接口)。
450 450
docs/搜索API对接指南-10-接口级压测脚本.md
@@ -4,7 +4,7 @@ @@ -4,7 +4,7 @@
4 4
5 ## 10. 接口级压测脚本 5 ## 10. 接口级压测脚本
6 6
7 -仓库提供统一压测脚本:`scripts/perf_api_benchmark.py`,用于对以下接口做并发压测: 7 +仓库提供统一压测脚本:`benchmarks/perf_api_benchmark.py`,用于对以下接口做并发压测:
8 8
9 - 后端搜索:`POST /search/` 9 - 后端搜索:`POST /search/`
10 - 搜索建议:`GET /search/suggestions` 10 - 搜索建议:`GET /search/suggestions`
@@ -18,21 +18,21 @@ @@ -18,21 +18,21 @@
18 18
19 ```bash 19 ```bash
20 # suggest 压测(tenant 162) 20 # suggest 压测(tenant 162)
21 -python scripts/perf_api_benchmark.py \ 21 +python benchmarks/perf_api_benchmark.py \
22 --scenario backend_suggest \ 22 --scenario backend_suggest \
23 --tenant-id 162 \ 23 --tenant-id 162 \
24 --duration 30 \ 24 --duration 30 \
25 --concurrency 50 25 --concurrency 50
26 26
27 # search 压测 27 # search 压测
28 -python scripts/perf_api_benchmark.py \ 28 +python benchmarks/perf_api_benchmark.py \
29 --scenario backend_search \ 29 --scenario backend_search \
30 --tenant-id 162 \ 30 --tenant-id 162 \
31 --duration 30 \ 31 --duration 30 \
32 --concurrency 20 32 --concurrency 20
33 33
34 # 全链路压测(search + suggest + embedding + translate + rerank) 34 # 全链路压测(search + suggest + embedding + translate + rerank)
35 -python scripts/perf_api_benchmark.py \ 35 +python benchmarks/perf_api_benchmark.py \
36 --scenario all \ 36 --scenario all \
37 --tenant-id 162 \ 37 --tenant-id 162 \
38 --duration 60 \ 38 --duration 60 \
@@ -45,17 +45,16 @@ python scripts/perf_api_benchmark.py \ @@ -45,17 +45,16 @@ python scripts/perf_api_benchmark.py \
45 可通过 `--cases-file` 覆盖默认请求模板。示例文件: 45 可通过 `--cases-file` 覆盖默认请求模板。示例文件:
46 46
47 ```bash 47 ```bash
48 -scripts/perf_cases.json.example 48 +benchmarks/perf_cases.json.example
49 ``` 49 ```
50 50
51 执行示例: 51 执行示例:
52 52
53 ```bash 53 ```bash
54 -python scripts/perf_api_benchmark.py \ 54 +python benchmarks/perf_api_benchmark.py \
55 --scenario all \ 55 --scenario all \
56 --tenant-id 162 \ 56 --tenant-id 162 \
57 - --cases-file scripts/perf_cases.json.example \ 57 + --cases-file benchmarks/perf_cases.json.example \
58 --duration 60 \ 58 --duration 60 \
59 --concurrency 40 59 --concurrency 40
60 ``` 60 ```
61 -  
docs/相关性检索优化说明.md
@@ -330,7 +330,7 @@ python -m pytest -q tests/test_rerank_client.py tests/test_es_query_builder.py t @@ -330,7 +330,7 @@ python -m pytest -q tests/test_rerank_client.py tests/test_es_query_builder.py t
330 ./scripts/service_ctl.sh restart backend 330 ./scripts/service_ctl.sh restart backend
331 sleep 3 331 sleep 3
332 ./scripts/service_ctl.sh status backend 332 ./scripts/service_ctl.sh status backend
333 -./scripts/evaluation/start_eval.sh.sh batch 333 +./scripts/evaluation/start_eval.sh batch
334 ``` 334 ```
335 335
336 评估产物在 `artifacts/search_evaluation/`(如 `search_eval.sqlite3`、`batch_reports/` 下的 JSON/Markdown)。流程与参数说明见 [scripts/evaluation/README.md](../scripts/evaluation/README.md)。 336 评估产物在 `artifacts/search_evaluation/`(如 `search_eval.sqlite3`、`batch_reports/` 下的 JSON/Markdown)。流程与参数说明见 [scripts/evaluation/README.md](../scripts/evaluation/README.md)。
@@ -895,4 +895,3 @@ rerank_score:0.4784 @@ -895,4 +895,3 @@ rerank_score:0.4784
895 rerank_score:0.5849 895 rerank_score:0.5849
896 "zh": "新款女士修身仿旧牛仔短裤 – 休闲性感磨边水洗牛仔短裤,时尚舒", 896 "zh": "新款女士修身仿旧牛仔短裤 – 休闲性感磨边水洗牛仔短裤,时尚舒",
897 "en": "New Women's Slim-fit Vintage Washed Denim Shorts – Casual Sexy Frayed Hem, Fashionable & Comfortable" 897 "en": "New Women's Slim-fit Vintage Washed Denim Shorts – Casual Sexy Frayed Hem, Fashionable & Comfortable"
898 -  
docs/缓存与Redis使用说明.md
@@ -196,18 +196,25 @@ services: @@ -196,18 +196,25 @@ services:
196 - 配置项: 196 - 配置项:
197 - `ANCHOR_CACHE_PREFIX = REDIS_CONFIG.get("anchor_cache_prefix", "product_anchors")` 197 - `ANCHOR_CACHE_PREFIX = REDIS_CONFIG.get("anchor_cache_prefix", "product_anchors")`
198 - `ANCHOR_CACHE_EXPIRE_DAYS = int(REDIS_CONFIG.get("anchor_cache_expire_days", 30))` 198 - `ANCHOR_CACHE_EXPIRE_DAYS = int(REDIS_CONFIG.get("anchor_cache_expire_days", 30))`
199 -- Key 构造函数:`_make_anchor_cache_key(title, target_lang, tenant_id)` 199 +- Key 构造函数:`_make_analysis_cache_key(product, target_lang, analysis_kind)`
200 - 模板: 200 - 模板:
201 201
202 ```text 202 ```text
203 -{ANCHOR_CACHE_PREFIX}:{tenant_or_global}:{target_lang}:{md5(title)} 203 +{ANCHOR_CACHE_PREFIX}:{analysis_kind}:{prompt_contract_hash}:{target_lang}:{prompt_input_prefix}{md5(prompt_input)}
204 ``` 204 ```
205 205
206 - 字段说明: 206 - 字段说明:
207 - `ANCHOR_CACHE_PREFIX`:默认 `"product_anchors"`,可通过 `.env` 中的 `REDIS_ANCHOR_CACHE_PREFIX`(若存在)间接配置到 `REDIS_CONFIG`; 207 - `ANCHOR_CACHE_PREFIX`:默认 `"product_anchors"`,可通过 `.env` 中的 `REDIS_ANCHOR_CACHE_PREFIX`(若存在)间接配置到 `REDIS_CONFIG`;
208 - - `tenant_or_global`:`tenant_id` 去空白后的字符串,若为空则使用 `"global"`; 208 + - `analysis_kind`:分析族,目前至少包括 `content` 与 `taxonomy`,两者缓存隔离;
  209 + - `prompt_contract_hash`:基于 system prompt、shared instruction、localized headers、result fields、user instruction template、schema cache version 等生成的短 hash;只要提示词或输出契约变化,缓存会自动失效;
209 - `target_lang`:内容理解输出语言,例如 `zh`; 210 - `target_lang`:内容理解输出语言,例如 `zh`;
210 - - `md5(title)`:对原始商品标题(UTF-8)做 MD5。 211 + - `prompt_input_prefix + md5(prompt_input)`:对真正送入 prompt 的商品文本做前缀 + MD5;当前 prompt 输入来自 `title`、`brief`、`description` 的规范化拼接结果。
  212 +
  213 +设计原则:
  214 +
  215 +- 只让**实际影响 LLM 输出**的输入参与 key;
  216 +- 不让 `tenant_id`、`spu_id` 这类“结果归属信息”污染缓存;
  217 +- prompt 或 schema 变更时,不依赖人工清理 Redis,也能自然切换到新 key。
211 218
212 ### 4.2 Value 与类型 219 ### 4.2 Value 与类型
213 220
@@ -229,6 +236,7 @@ services: @@ -229,6 +236,7 @@ services:
229 ``` 236 ```
230 237
231 - 读取时通过 `json.loads(raw)` 还原为 `Dict[str, Any]`。 238 - 读取时通过 `json.loads(raw)` 还原为 `Dict[str, Any]`。
  239 +- `content` 与 `taxonomy` 的 value 结构会随各自 schema 不同而不同,但都会先通过统一的 normalize 逻辑再写缓存。
232 240
233 ### 4.3 过期策略 241 ### 4.3 过期策略
234 242
embeddings/README.md
@@ -98,10 +98,10 @@ @@ -98,10 +98,10 @@
98 98
99 ### 性能与压测(沿用仓库脚本) 99 ### 性能与压测(沿用仓库脚本)
100 100
101 -- 接口级压测(与 `perf_reports/2026-03-12/matrix_report/` 等方法一致):`scripts/perf_api_benchmark.py`  
102 - - 示例:`python scripts/perf_api_benchmark.py --scenario embed_text --duration 30 --concurrency 20` 101 +- 接口级压测(与 `perf_reports/2026-03-12/matrix_report/` 等方法一致):`benchmarks/perf_api_benchmark.py`
  102 + - 示例:`python benchmarks/perf_api_benchmark.py --scenario embed_text --duration 30 --concurrency 20`
103 - 文本/图片向量可带 `priority`(与线上 admission 语义一致):`--embed-text-priority 1`、`--embed-image-priority 1` 103 - 文本/图片向量可带 `priority`(与线上 admission 语义一致):`--embed-text-priority 1`、`--embed-image-priority 1`
104 - - 自定义请求模板:`--cases-file scripts/perf_cases.json.example` 104 + - 自定义请求模板:`--cases-file benchmarks/perf_cases.json.example`
105 - 历史矩阵结果与说明见 `perf_reports/2026-03-12/matrix_report/summary.md`。 105 - 历史矩阵结果与说明见 `perf_reports/2026-03-12/matrix_report/summary.md`。
106 106
107 ### 启动服务 107 ### 启动服务
frontend/static/js/app.js
@@ -316,7 +316,10 @@ async function performSearch(page = 1) { @@ -316,7 +316,10 @@ async function performSearch(page = 1) {
316 document.getElementById('productGrid').innerHTML = ''; 316 document.getElementById('productGrid').innerHTML = '';
317 317
318 try { 318 try {
319 - const response = await fetch(`${API_BASE_URL}/search/`, { 319 + const searchUrl = new URL(`${API_BASE_URL}/search/`, window.location.origin);
  320 + searchUrl.searchParams.set('tenant_id', tenantId);
  321 +
  322 + const response = await fetch(searchUrl.toString(), {
320 method: 'POST', 323 method: 'POST',
321 headers: { 324 headers: {
322 'Content-Type': 'application/json', 325 'Content-Type': 'application/json',
@@ -8,7 +8,7 @@ @@ -8,7 +8,7 @@
8 8
9 ### 1.1 系统角色划分 9 ### 1.1 系统角色划分
10 10
11 -- **Java 索引程序(/home/tw/saas-server)** 11 +- **Java 索引程序**
12 - 负责“**什么时候、对哪些 SPU 做索引**”(调度 & 触发)。 12 - 负责“**什么时候、对哪些 SPU 做索引**”(调度 & 触发)。
13 - 负责**商品/店铺/类目等基础数据同步**(写 MySQL)。 13 - 负责**商品/店铺/类目等基础数据同步**(写 MySQL)。
14 - 负责**多租户环境下的全量/增量索引调度**,但不再关心具体 doc 字段细节。 14 - 负责**多租户环境下的全量/增量索引调度**,但不再关心具体 doc 字段细节。
indexer/Untitled 0 → 100644
@@ -0,0 +1 @@ @@ -0,0 +1 @@
  1 +taxonomy
0 \ No newline at end of file 2 \ No newline at end of file
indexer/document_transformer.py
@@ -242,6 +242,7 @@ class SPUDocumentTransformer: @@ -242,6 +242,7 @@ class SPUDocumentTransformer:
242 - qanchors.{lang} 242 - qanchors.{lang}
243 - enriched_tags.{lang} 243 - enriched_tags.{lang}
244 - enriched_attributes[].value.{lang} 244 - enriched_attributes[].value.{lang}
  245 + - enriched_taxonomy_attributes[].value.{lang}
245 246
246 设计目标: 247 设计目标:
247 - 尽可能攒批调用 LLM; 248 - 尽可能攒批调用 LLM;
@@ -273,7 +274,12 @@ class SPUDocumentTransformer: @@ -273,7 +274,12 @@ class SPUDocumentTransformer:
273 274
274 tenant_id = str(docs[0].get("tenant_id") or "").strip() or None 275 tenant_id = str(docs[0].get("tenant_id") or "").strip() or None
275 try: 276 try:
276 - results = build_index_content_fields(items=items, tenant_id=tenant_id) 277 + # TODO: 从数据库读取该 tenant 的真实行业,并据此替换当前默认的 apparel profile。
  278 + results = build_index_content_fields(
  279 + items=items,
  280 + tenant_id=tenant_id,
  281 + category_taxonomy_profile="apparel",
  282 + )
277 except Exception as e: 283 except Exception as e:
278 logger.warning("LLM batch attribute fill failed: %s", e) 284 logger.warning("LLM batch attribute fill failed: %s", e)
279 return 285 return
@@ -296,6 +302,8 @@ class SPUDocumentTransformer: @@ -296,6 +302,8 @@ class SPUDocumentTransformer:
296 doc["enriched_tags"] = enrichment["enriched_tags"] 302 doc["enriched_tags"] = enrichment["enriched_tags"]
297 if enrichment.get("enriched_attributes"): 303 if enrichment.get("enriched_attributes"):
298 doc["enriched_attributes"] = enrichment["enriched_attributes"] 304 doc["enriched_attributes"] = enrichment["enriched_attributes"]
  305 + if enrichment.get("enriched_taxonomy_attributes"):
  306 + doc["enriched_taxonomy_attributes"] = enrichment["enriched_taxonomy_attributes"]
299 except Exception as e: 307 except Exception as e:
300 logger.warning("Failed to apply enrichment to doc (spu_id=%s): %s", doc.get("spu_id"), e) 308 logger.warning("Failed to apply enrichment to doc (spu_id=%s): %s", doc.get("spu_id"), e)
301 309
@@ -666,6 +674,7 @@ class SPUDocumentTransformer: @@ -666,6 +674,7 @@ class SPUDocumentTransformer:
666 674
667 tenant_id = doc.get("tenant_id") 675 tenant_id = doc.get("tenant_id")
668 try: 676 try:
  677 + # TODO: 从数据库读取该 tenant 的真实行业,并据此替换当前默认的 apparel profile。
669 results = build_index_content_fields( 678 results = build_index_content_fields(
670 items=[ 679 items=[
671 { 680 {
@@ -677,6 +686,7 @@ class SPUDocumentTransformer: @@ -677,6 +686,7 @@ class SPUDocumentTransformer:
677 } 686 }
678 ], 687 ],
679 tenant_id=str(tenant_id), 688 tenant_id=str(tenant_id),
  689 + category_taxonomy_profile="apparel",
680 ) 690 )
681 except Exception as e: 691 except Exception as e:
682 logger.warning("LLM attribute fill failed for SPU %s: %s", spu_id, e) 692 logger.warning("LLM attribute fill failed for SPU %s: %s", spu_id, e)
indexer/product_enrich.py
@@ -14,10 +14,11 @@ import time @@ -14,10 +14,11 @@ import time
14 import hashlib 14 import hashlib
15 import uuid 15 import uuid
16 import threading 16 import threading
  17 +from dataclasses import dataclass, field
17 from collections import OrderedDict 18 from collections import OrderedDict
18 from datetime import datetime 19 from datetime import datetime
19 from concurrent.futures import ThreadPoolExecutor 20 from concurrent.futures import ThreadPoolExecutor
20 -from typing import List, Dict, Tuple, Any, Optional 21 +from typing import List, Dict, Tuple, Any, Optional, FrozenSet
21 22
22 import redis 23 import redis
23 import requests 24 import requests
@@ -30,6 +31,7 @@ from indexer.product_enrich_prompts import ( @@ -30,6 +31,7 @@ from indexer.product_enrich_prompts import (
30 USER_INSTRUCTION_TEMPLATE, 31 USER_INSTRUCTION_TEMPLATE,
31 LANGUAGE_MARKDOWN_TABLE_HEADERS, 32 LANGUAGE_MARKDOWN_TABLE_HEADERS,
32 SHARED_ANALYSIS_INSTRUCTION, 33 SHARED_ANALYSIS_INSTRUCTION,
  34 + CATEGORY_TAXONOMY_PROFILES,
33 ) 35 )
34 36
35 # 配置 37 # 配置
@@ -144,10 +146,26 @@ if _missing_prompt_langs: @@ -144,10 +146,26 @@ if _missing_prompt_langs:
144 ) 146 )
145 147
146 148
147 -# 多值字段分隔:英文逗号、中文逗号、顿号,及历史约定的 ; | / 与空白 149 +# 多值字段分隔
148 _MULTI_VALUE_FIELD_SPLIT_RE = re.compile(r"[,、,;|/\n\t]+") 150 _MULTI_VALUE_FIELD_SPLIT_RE = re.compile(r"[,、,;|/\n\t]+")
  151 +# 表格单元格中视为「无内容」的占位
  152 +_MARKDOWN_EMPTY_CELL_LITERALS: Tuple[str, ...] = ("-","–", "—", "none", "null", "n/a", "无")
  153 +_MARKDOWN_EMPTY_CELL_TOKENS_CF: FrozenSet[str] = frozenset(
  154 + lit.casefold() for lit in _MARKDOWN_EMPTY_CELL_LITERALS
  155 +)
  156 +
  157 +def _normalize_markdown_table_cell(raw: Optional[str]) -> str:
  158 + """strip;将占位符统一视为空字符串。"""
  159 + s = str(raw or "").strip()
  160 + if not s:
  161 + return ""
  162 + if s.casefold() in _MARKDOWN_EMPTY_CELL_TOKENS_CF:
  163 + return ""
  164 + return s
149 _CORE_INDEX_LANGUAGES = ("zh", "en") 165 _CORE_INDEX_LANGUAGES = ("zh", "en")
150 -_ANALYSIS_ATTRIBUTE_FIELD_MAP = ( 166 +_DEFAULT_ENRICHMENT_SCOPES = ("generic", "category_taxonomy")
  167 +_DEFAULT_CATEGORY_TAXONOMY_PROFILE = "apparel"
  168 +_CONTENT_ANALYSIS_ATTRIBUTE_FIELD_MAP = (
151 ("tags", "enriched_tags"), 169 ("tags", "enriched_tags"),
152 ("target_audience", "target_audience"), 170 ("target_audience", "target_audience"),
153 ("usage_scene", "usage_scene"), 171 ("usage_scene", "usage_scene"),
@@ -156,7 +174,7 @@ _ANALYSIS_ATTRIBUTE_FIELD_MAP = ( @@ -156,7 +174,7 @@ _ANALYSIS_ATTRIBUTE_FIELD_MAP = (
156 ("material", "material"), 174 ("material", "material"),
157 ("features", "features"), 175 ("features", "features"),
158 ) 176 )
159 -_ANALYSIS_RESULT_FIELDS = ( 177 +_CONTENT_ANALYSIS_RESULT_FIELDS = (
160 "title", 178 "title",
161 "category_path", 179 "category_path",
162 "tags", 180 "tags",
@@ -168,7 +186,7 @@ _ANALYSIS_RESULT_FIELDS = ( @@ -168,7 +186,7 @@ _ANALYSIS_RESULT_FIELDS = (
168 "features", 186 "features",
169 "anchor_text", 187 "anchor_text",
170 ) 188 )
171 -_ANALYSIS_MEANINGFUL_FIELDS = ( 189 +_CONTENT_ANALYSIS_MEANINGFUL_FIELDS = (
172 "tags", 190 "tags",
173 "target_audience", 191 "target_audience",
174 "usage_scene", 192 "usage_scene",
@@ -178,9 +196,111 @@ _ANALYSIS_MEANINGFUL_FIELDS = ( @@ -178,9 +196,111 @@ _ANALYSIS_MEANINGFUL_FIELDS = (
178 "features", 196 "features",
179 "anchor_text", 197 "anchor_text",
180 ) 198 )
181 -_ANALYSIS_FIELD_ALIASES = { 199 +_CONTENT_ANALYSIS_FIELD_ALIASES = {
182 "tags": ("tags", "enriched_tags"), 200 "tags": ("tags", "enriched_tags"),
183 } 201 }
  202 +_CONTENT_ANALYSIS_QUALITY_FIELDS = ("title", "category_path", "anchor_text")
  203 +
  204 +
  205 +@dataclass(frozen=True)
  206 +class AnalysisSchema:
  207 + name: str
  208 + shared_instruction: str
  209 + markdown_table_headers: Dict[str, List[str]]
  210 + result_fields: Tuple[str, ...]
  211 + meaningful_fields: Tuple[str, ...]
  212 + cache_version: str = "v1"
  213 + field_aliases: Dict[str, Tuple[str, ...]] = field(default_factory=dict)
  214 + quality_fields: Tuple[str, ...] = ()
  215 +
  216 + def get_headers(self, target_lang: str) -> Optional[List[str]]:
  217 + return self.markdown_table_headers.get(target_lang)
  218 +
  219 +
  220 +_ANALYSIS_SCHEMAS: Dict[str, AnalysisSchema] = {
  221 + "content": AnalysisSchema(
  222 + name="content",
  223 + shared_instruction=SHARED_ANALYSIS_INSTRUCTION,
  224 + markdown_table_headers=LANGUAGE_MARKDOWN_TABLE_HEADERS,
  225 + result_fields=_CONTENT_ANALYSIS_RESULT_FIELDS,
  226 + meaningful_fields=_CONTENT_ANALYSIS_MEANINGFUL_FIELDS,
  227 + cache_version="v2",
  228 + field_aliases=_CONTENT_ANALYSIS_FIELD_ALIASES,
  229 + quality_fields=_CONTENT_ANALYSIS_QUALITY_FIELDS,
  230 + ),
  231 +}
  232 +
  233 +def _build_taxonomy_profile_schema(profile: str, config: Dict[str, Any]) -> AnalysisSchema:
  234 + return AnalysisSchema(
  235 + name=f"taxonomy:{profile}",
  236 + shared_instruction=config["shared_instruction"],
  237 + markdown_table_headers=config["markdown_table_headers"],
  238 + result_fields=tuple(field["key"] for field in config["fields"]),
  239 + meaningful_fields=tuple(field["key"] for field in config["fields"]),
  240 + cache_version="v1",
  241 + )
  242 +
  243 +
  244 +_CATEGORY_TAXONOMY_PROFILE_SCHEMAS: Dict[str, AnalysisSchema] = {
  245 + profile: _build_taxonomy_profile_schema(profile, config)
  246 + for profile, config in CATEGORY_TAXONOMY_PROFILES.items()
  247 +}
  248 +
  249 +_CATEGORY_TAXONOMY_PROFILE_ATTRIBUTE_FIELD_MAPS: Dict[str, Tuple[Tuple[str, str], ...]] = {
  250 + profile: tuple((field["key"], field["label"]) for field in config["fields"])
  251 + for profile, config in CATEGORY_TAXONOMY_PROFILES.items()
  252 +}
  253 +
  254 +
  255 +def get_supported_category_taxonomy_profiles() -> Tuple[str, ...]:
  256 + return tuple(_CATEGORY_TAXONOMY_PROFILE_SCHEMAS.keys())
  257 +
  258 +
  259 +def _normalize_category_taxonomy_profile(category_taxonomy_profile: Optional[str] = None) -> str:
  260 + profile = str(category_taxonomy_profile or _DEFAULT_CATEGORY_TAXONOMY_PROFILE).strip()
  261 + if profile not in _CATEGORY_TAXONOMY_PROFILE_SCHEMAS:
  262 + supported = ", ".join(get_supported_category_taxonomy_profiles())
  263 + raise ValueError(
  264 + f"Unsupported category_taxonomy_profile: {profile}. Supported profiles: {supported}"
  265 + )
  266 + return profile
  267 +
  268 +
  269 +def _get_analysis_schema(
  270 + analysis_kind: str,
  271 + *,
  272 + category_taxonomy_profile: Optional[str] = None,
  273 +) -> AnalysisSchema:
  274 + if analysis_kind == "content":
  275 + return _ANALYSIS_SCHEMAS["content"]
  276 + if analysis_kind == "taxonomy":
  277 + profile = _normalize_category_taxonomy_profile(category_taxonomy_profile)
  278 + return _CATEGORY_TAXONOMY_PROFILE_SCHEMAS[profile]
  279 + raise ValueError(f"Unsupported analysis_kind: {analysis_kind}")
  280 +
  281 +
  282 +def _get_taxonomy_attribute_field_map(
  283 + category_taxonomy_profile: Optional[str] = None,
  284 +) -> Tuple[Tuple[str, str], ...]:
  285 + profile = _normalize_category_taxonomy_profile(category_taxonomy_profile)
  286 + return _CATEGORY_TAXONOMY_PROFILE_ATTRIBUTE_FIELD_MAPS[profile]
  287 +
  288 +
  289 +def _normalize_enrichment_scopes(
  290 + enrichment_scopes: Optional[List[str]] = None,
  291 +) -> Tuple[str, ...]:
  292 + requested = _DEFAULT_ENRICHMENT_SCOPES if not enrichment_scopes else tuple(enrichment_scopes)
  293 + normalized: List[str] = []
  294 + seen = set()
  295 + for enrichment_scope in requested:
  296 + scope = str(enrichment_scope).strip()
  297 + if scope not in {"generic", "category_taxonomy"}:
  298 + raise ValueError(f"Unsupported enrichment_scope: {scope}")
  299 + if scope in seen:
  300 + continue
  301 + seen.add(scope)
  302 + normalized.append(scope)
  303 + return tuple(normalized)
184 304
185 305
186 def split_multi_value_field(text: Optional[str]) -> List[str]: 306 def split_multi_value_field(text: Optional[str]) -> List[str]:
@@ -235,12 +355,12 @@ def _get_product_id(product: Dict[str, Any]) -&gt; str: @@ -235,12 +355,12 @@ def _get_product_id(product: Dict[str, Any]) -&gt; str:
235 return str(product.get("id") or product.get("spu_id") or "").strip() 355 return str(product.get("id") or product.get("spu_id") or "").strip()
236 356
237 357
238 -def _get_analysis_field_aliases(field_name: str) -> Tuple[str, ...]:  
239 - return _ANALYSIS_FIELD_ALIASES.get(field_name, (field_name,)) 358 +def _get_analysis_field_aliases(field_name: str, schema: AnalysisSchema) -> Tuple[str, ...]:
  359 + return schema.field_aliases.get(field_name, (field_name,))
240 360
241 361
242 -def _get_analysis_field_value(row: Dict[str, Any], field_name: str) -> Any:  
243 - for alias in _get_analysis_field_aliases(field_name): 362 +def _get_analysis_field_value(row: Dict[str, Any], field_name: str, schema: AnalysisSchema) -> Any:
  363 + for alias in _get_analysis_field_aliases(field_name, schema):
244 if alias in row: 364 if alias in row:
245 return row.get(alias) 365 return row.get(alias)
246 return None 366 return None
@@ -261,6 +381,7 @@ def _has_meaningful_value(value: Any) -&gt; bool: @@ -261,6 +381,7 @@ def _has_meaningful_value(value: Any) -&gt; bool:
261 def _make_empty_analysis_result( 381 def _make_empty_analysis_result(
262 product: Dict[str, Any], 382 product: Dict[str, Any],
263 target_lang: str, 383 target_lang: str,
  384 + schema: AnalysisSchema,
264 error: Optional[str] = None, 385 error: Optional[str] = None,
265 ) -> Dict[str, Any]: 386 ) -> Dict[str, Any]:
266 result = { 387 result = {
@@ -268,7 +389,7 @@ def _make_empty_analysis_result( @@ -268,7 +389,7 @@ def _make_empty_analysis_result(
268 "lang": target_lang, 389 "lang": target_lang,
269 "title_input": str(product.get("title") or "").strip(), 390 "title_input": str(product.get("title") or "").strip(),
270 } 391 }
271 - for field in _ANALYSIS_RESULT_FIELDS: 392 + for field in schema.result_fields:
272 result[field] = "" 393 result[field] = ""
273 if error: 394 if error:
274 result["error"] = error 395 result["error"] = error
@@ -279,42 +400,59 @@ def _normalize_analysis_result( @@ -279,42 +400,59 @@ def _normalize_analysis_result(
279 result: Dict[str, Any], 400 result: Dict[str, Any],
280 product: Dict[str, Any], 401 product: Dict[str, Any],
281 target_lang: str, 402 target_lang: str,
  403 + schema: AnalysisSchema,
282 ) -> Dict[str, Any]: 404 ) -> Dict[str, Any]:
283 - normalized = _make_empty_analysis_result(product, target_lang) 405 + normalized = _make_empty_analysis_result(product, target_lang, schema)
284 if not isinstance(result, dict): 406 if not isinstance(result, dict):
285 return normalized 407 return normalized
286 408
287 normalized["lang"] = str(result.get("lang") or target_lang).strip() or target_lang 409 normalized["lang"] = str(result.get("lang") or target_lang).strip() or target_lang
288 - normalized["title"] = str(result.get("title") or "").strip()  
289 - normalized["category_path"] = str(result.get("category_path") or "").strip()  
290 normalized["title_input"] = str( 410 normalized["title_input"] = str(
291 product.get("title") or result.get("title_input") or "" 411 product.get("title") or result.get("title_input") or ""
292 ).strip() 412 ).strip()
293 413
294 - for field in _ANALYSIS_RESULT_FIELDS:  
295 - if field in {"title", "category_path"}:  
296 - continue  
297 - normalized[field] = str(_get_analysis_field_value(result, field) or "").strip() 414 + for field in schema.result_fields:
  415 + normalized[field] = str(_get_analysis_field_value(result, field, schema) or "").strip()
298 416
299 if result.get("error"): 417 if result.get("error"):
300 normalized["error"] = str(result.get("error")) 418 normalized["error"] = str(result.get("error"))
301 return normalized 419 return normalized
302 420
303 421
304 -def _has_meaningful_analysis_content(result: Dict[str, Any]) -> bool:  
305 - return any(_has_meaningful_value(result.get(field)) for field in _ANALYSIS_MEANINGFUL_FIELDS) 422 +def _has_meaningful_analysis_content(result: Dict[str, Any], schema: AnalysisSchema) -> bool:
  423 + return any(_has_meaningful_value(result.get(field)) for field in schema.meaningful_fields)
  424 +
  425 +
  426 +def _append_analysis_attributes(
  427 + target: List[Dict[str, Any]],
  428 + row: Dict[str, Any],
  429 + lang: str,
  430 + schema: AnalysisSchema,
  431 + field_map: Tuple[Tuple[str, str], ...],
  432 +) -> None:
  433 + for source_name, output_name in field_map:
  434 + raw = _get_analysis_field_value(row, source_name, schema)
  435 + if not raw:
  436 + continue
  437 + _append_named_lang_phrase_map(
  438 + target,
  439 + name=output_name,
  440 + lang=lang,
  441 + raw_value=raw,
  442 + )
306 443
307 444
308 def _apply_index_content_row(result: Dict[str, Any], row: Dict[str, Any], lang: str) -> None: 445 def _apply_index_content_row(result: Dict[str, Any], row: Dict[str, Any], lang: str) -> None:
309 if not row or row.get("error"): 446 if not row or row.get("error"):
310 return 447 return
311 448
312 - anchor_text = str(_get_analysis_field_value(row, "anchor_text") or "").strip() 449 + content_schema = _get_analysis_schema("content")
  450 + anchor_text = str(_get_analysis_field_value(row, "anchor_text", content_schema) or "").strip()
313 if anchor_text: 451 if anchor_text:
314 _append_lang_phrase_map(result["qanchors"], lang=lang, raw_value=anchor_text) 452 _append_lang_phrase_map(result["qanchors"], lang=lang, raw_value=anchor_text)
315 453
316 - for source_name, output_name in _ANALYSIS_ATTRIBUTE_FIELD_MAP:  
317 - raw = _get_analysis_field_value(row, source_name) 454 + for source_name, output_name in _CONTENT_ANALYSIS_ATTRIBUTE_FIELD_MAP:
  455 + raw = _get_analysis_field_value(row, source_name, content_schema)
318 if not raw: 456 if not raw:
319 continue 457 continue
320 _append_named_lang_phrase_map( 458 _append_named_lang_phrase_map(
@@ -327,6 +465,28 @@ def _apply_index_content_row(result: Dict[str, Any], row: Dict[str, Any], lang: @@ -327,6 +465,28 @@ def _apply_index_content_row(result: Dict[str, Any], row: Dict[str, Any], lang:
327 _append_lang_phrase_map(result["enriched_tags"], lang=lang, raw_value=raw) 465 _append_lang_phrase_map(result["enriched_tags"], lang=lang, raw_value=raw)
328 466
329 467
  468 +def _apply_index_taxonomy_row(
  469 + result: Dict[str, Any],
  470 + row: Dict[str, Any],
  471 + lang: str,
  472 + *,
  473 + category_taxonomy_profile: Optional[str] = None,
  474 +) -> None:
  475 + if not row or row.get("error"):
  476 + return
  477 +
  478 + _append_analysis_attributes(
  479 + result["enriched_taxonomy_attributes"],
  480 + row=row,
  481 + lang=lang,
  482 + schema=_get_analysis_schema(
  483 + "taxonomy",
  484 + category_taxonomy_profile=category_taxonomy_profile,
  485 + ),
  486 + field_map=_get_taxonomy_attribute_field_map(category_taxonomy_profile),
  487 + )
  488 +
  489 +
330 def _normalize_index_content_item(item: Dict[str, Any]) -> Dict[str, str]: 490 def _normalize_index_content_item(item: Dict[str, Any]) -> Dict[str, str]:
331 item_id = _get_product_id(item) 491 item_id = _get_product_id(item)
332 return { 492 return {
@@ -341,6 +501,8 @@ def _normalize_index_content_item(item: Dict[str, Any]) -&gt; Dict[str, str]: @@ -341,6 +501,8 @@ def _normalize_index_content_item(item: Dict[str, Any]) -&gt; Dict[str, str]:
341 def build_index_content_fields( 501 def build_index_content_fields(
342 items: List[Dict[str, Any]], 502 items: List[Dict[str, Any]],
343 tenant_id: Optional[str] = None, 503 tenant_id: Optional[str] = None,
  504 + enrichment_scopes: Optional[List[str]] = None,
  505 + category_taxonomy_profile: Optional[str] = None,
344 ) -> List[Dict[str, Any]]: 506 ) -> List[Dict[str, Any]]:
345 """ 507 """
346 高层入口:生成与 ES mapping 对齐的内容理解字段。 508 高层入口:生成与 ES mapping 对齐的内容理解字段。
@@ -349,18 +511,23 @@ def build_index_content_fields( @@ -349,18 +511,23 @@ def build_index_content_fields(
349 - `id` 或 `spu_id` 511 - `id` 或 `spu_id`
350 - `title` 512 - `title`
351 - 可选 `brief` / `description` / `image_url` 513 - 可选 `brief` / `description` / `image_url`
  514 + - 可选 `enrichment_scopes`,默认同时执行 `generic` 与 `category_taxonomy`
  515 + - 可选 `category_taxonomy_profile`,默认 `apparel`
352 516
353 返回项结构: 517 返回项结构:
354 - `id` 518 - `id`
355 - `qanchors` 519 - `qanchors`
356 - `enriched_tags` 520 - `enriched_tags`
357 - `enriched_attributes` 521 - `enriched_attributes`
  522 + - `enriched_taxonomy_attributes`
358 - 可选 `error` 523 - 可选 `error`
359 524
360 其中: 525 其中:
361 - `qanchors.{lang}` 为短语数组 526 - `qanchors.{lang}` 为短语数组
362 - `enriched_tags.{lang}` 为标签数组 527 - `enriched_tags.{lang}` 为标签数组
363 """ 528 """
  529 + requested_enrichment_scopes = _normalize_enrichment_scopes(enrichment_scopes)
  530 + normalized_taxonomy_profile = _normalize_category_taxonomy_profile(category_taxonomy_profile)
364 normalized_items = [_normalize_index_content_item(item) for item in items] 531 normalized_items = [_normalize_index_content_item(item) for item in items]
365 if not normalized_items: 532 if not normalized_items:
366 return [] 533 return []
@@ -371,32 +538,72 @@ def build_index_content_fields( @@ -371,32 +538,72 @@ def build_index_content_fields(
371 "qanchors": {}, 538 "qanchors": {},
372 "enriched_tags": {}, 539 "enriched_tags": {},
373 "enriched_attributes": [], 540 "enriched_attributes": [],
  541 + "enriched_taxonomy_attributes": [],
374 } 542 }
375 for item in normalized_items 543 for item in normalized_items
376 } 544 }
377 545
378 for lang in _CORE_INDEX_LANGUAGES: 546 for lang in _CORE_INDEX_LANGUAGES:
379 - try:  
380 - rows = analyze_products(  
381 - products=normalized_items,  
382 - target_lang=lang,  
383 - batch_size=BATCH_SIZE,  
384 - tenant_id=tenant_id,  
385 - )  
386 - except Exception as e:  
387 - logger.warning("build_index_content_fields failed for lang=%s: %s", lang, e)  
388 - for item in normalized_items:  
389 - results_by_id[item["id"]].setdefault("error", str(e))  
390 - continue  
391 -  
392 - for row in rows or []:  
393 - item_id = str(row.get("id") or "").strip()  
394 - if not item_id or item_id not in results_by_id: 547 + if "generic" in requested_enrichment_scopes:
  548 + try:
  549 + rows = analyze_products(
  550 + products=normalized_items,
  551 + target_lang=lang,
  552 + batch_size=BATCH_SIZE,
  553 + tenant_id=tenant_id,
  554 + analysis_kind="content",
  555 + category_taxonomy_profile=normalized_taxonomy_profile,
  556 + )
  557 + except Exception as e:
  558 + logger.warning("build_index_content_fields content enrichment failed for lang=%s: %s", lang, e)
  559 + for item in normalized_items:
  560 + results_by_id[item["id"]].setdefault("error", str(e))
395 continue 561 continue
396 - if row.get("error"):  
397 - results_by_id[item_id].setdefault("error", row["error"]) 562 +
  563 + for row in rows or []:
  564 + item_id = str(row.get("id") or "").strip()
  565 + if not item_id or item_id not in results_by_id:
  566 + continue
  567 + if row.get("error"):
  568 + results_by_id[item_id].setdefault("error", row["error"])
  569 + continue
  570 + _apply_index_content_row(results_by_id[item_id], row=row, lang=lang)
  571 +
  572 + if "category_taxonomy" in requested_enrichment_scopes:
  573 + for lang in _CORE_INDEX_LANGUAGES:
  574 + try:
  575 + taxonomy_rows = analyze_products(
  576 + products=normalized_items,
  577 + target_lang=lang,
  578 + batch_size=BATCH_SIZE,
  579 + tenant_id=tenant_id,
  580 + analysis_kind="taxonomy",
  581 + category_taxonomy_profile=normalized_taxonomy_profile,
  582 + )
  583 + except Exception as e:
  584 + logger.warning(
  585 + "build_index_content_fields taxonomy enrichment failed for profile=%s lang=%s: %s",
  586 + normalized_taxonomy_profile,
  587 + lang,
  588 + e,
  589 + )
  590 + for item in normalized_items:
  591 + results_by_id[item["id"]].setdefault("error", str(e))
398 continue 592 continue
399 - _apply_index_content_row(results_by_id[item_id], row=row, lang=lang) 593 +
  594 + for row in taxonomy_rows or []:
  595 + item_id = str(row.get("id") or "").strip()
  596 + if not item_id or item_id not in results_by_id:
  597 + continue
  598 + if row.get("error"):
  599 + results_by_id[item_id].setdefault("error", row["error"])
  600 + continue
  601 + _apply_index_taxonomy_row(
  602 + results_by_id[item_id],
  603 + row=row,
  604 + lang=lang,
  605 + category_taxonomy_profile=normalized_taxonomy_profile,
  606 + )
400 607
401 return [results_by_id[item["id"]] for item in normalized_items] 608 return [results_by_id[item["id"]] for item in normalized_items]
402 609
@@ -463,52 +670,129 @@ def _build_prompt_input_text(product: Dict[str, Any]) -&gt; str: @@ -463,52 +670,129 @@ def _build_prompt_input_text(product: Dict[str, Any]) -&gt; str:
463 return _truncate_by_words(candidate, PROMPT_INPUT_MAX_WORDS) 670 return _truncate_by_words(candidate, PROMPT_INPUT_MAX_WORDS)
464 671
465 672
466 -def _make_anchor_cache_key( 673 +def _make_analysis_cache_key(
467 product: Dict[str, Any], 674 product: Dict[str, Any],
468 target_lang: str, 675 target_lang: str,
  676 + analysis_kind: str,
  677 + category_taxonomy_profile: Optional[str] = None,
469 ) -> str: 678 ) -> str:
470 - """构造缓存 key,仅由 prompt 实际输入文本内容 + 目标语言决定。""" 679 + """构造缓存 key,仅由分析类型、prompt 实际输入文本内容与目标语言决定。"""
  680 + schema = _get_analysis_schema(
  681 + analysis_kind,
  682 + category_taxonomy_profile=category_taxonomy_profile,
  683 + )
471 prompt_input = _build_prompt_input_text(product) 684 prompt_input = _build_prompt_input_text(product)
472 h = hashlib.md5(prompt_input.encode("utf-8")).hexdigest() 685 h = hashlib.md5(prompt_input.encode("utf-8")).hexdigest()
473 - return f"{ANCHOR_CACHE_PREFIX}:{target_lang}:{prompt_input[:4]}{h}" 686 + prompt_contract = {
  687 + "schema_name": schema.name,
  688 + "cache_version": schema.cache_version,
  689 + "system_message": SYSTEM_MESSAGE,
  690 + "user_instruction_template": USER_INSTRUCTION_TEMPLATE,
  691 + "shared_instruction": schema.shared_instruction,
  692 + "assistant_headers": schema.get_headers(target_lang),
  693 + "result_fields": schema.result_fields,
  694 + "meaningful_fields": schema.meaningful_fields,
  695 + "field_aliases": schema.field_aliases,
  696 + }
  697 + prompt_contract_hash = hashlib.md5(
  698 + json.dumps(prompt_contract, ensure_ascii=False, sort_keys=True).encode("utf-8")
  699 + ).hexdigest()[:12]
  700 + return (
  701 + f"{ANCHOR_CACHE_PREFIX}:{analysis_kind}:{prompt_contract_hash}:"
  702 + f"{target_lang}:{prompt_input[:4]}{h}"
  703 + )
474 704
475 705
476 -def _get_cached_anchor_result( 706 +def _make_anchor_cache_key(
477 product: Dict[str, Any], 707 product: Dict[str, Any],
478 target_lang: str, 708 target_lang: str,
  709 +) -> str:
  710 + return _make_analysis_cache_key(product, target_lang, analysis_kind="content")
  711 +
  712 +
  713 +def _get_cached_analysis_result(
  714 + product: Dict[str, Any],
  715 + target_lang: str,
  716 + analysis_kind: str,
  717 + category_taxonomy_profile: Optional[str] = None,
479 ) -> Optional[Dict[str, Any]]: 718 ) -> Optional[Dict[str, Any]]:
480 if not _anchor_redis: 719 if not _anchor_redis:
481 return None 720 return None
  721 + schema = _get_analysis_schema(
  722 + analysis_kind,
  723 + category_taxonomy_profile=category_taxonomy_profile,
  724 + )
482 try: 725 try:
483 - key = _make_anchor_cache_key(product, target_lang) 726 + key = _make_analysis_cache_key(
  727 + product,
  728 + target_lang,
  729 + analysis_kind,
  730 + category_taxonomy_profile=category_taxonomy_profile,
  731 + )
484 raw = _anchor_redis.get(key) 732 raw = _anchor_redis.get(key)
485 if not raw: 733 if not raw:
486 return None 734 return None
487 - result = _normalize_analysis_result(json.loads(raw), product=product, target_lang=target_lang)  
488 - if not _has_meaningful_analysis_content(result): 735 + result = _normalize_analysis_result(
  736 + json.loads(raw),
  737 + product=product,
  738 + target_lang=target_lang,
  739 + schema=schema,
  740 + )
  741 + if not _has_meaningful_analysis_content(result, schema):
489 return None 742 return None
490 return result 743 return result
491 except Exception as e: 744 except Exception as e:
492 - logger.warning(f"Failed to get anchor cache: {e}") 745 + logger.warning("Failed to get %s analysis cache: %s", analysis_kind, e)
493 return None 746 return None
494 747
495 748
496 -def _set_cached_anchor_result( 749 +def _get_cached_anchor_result(
  750 + product: Dict[str, Any],
  751 + target_lang: str,
  752 +) -> Optional[Dict[str, Any]]:
  753 + return _get_cached_analysis_result(product, target_lang, analysis_kind="content")
  754 +
  755 +
  756 +def _set_cached_analysis_result(
497 product: Dict[str, Any], 757 product: Dict[str, Any],
498 target_lang: str, 758 target_lang: str,
499 result: Dict[str, Any], 759 result: Dict[str, Any],
  760 + analysis_kind: str,
  761 + category_taxonomy_profile: Optional[str] = None,
500 ) -> None: 762 ) -> None:
501 if not _anchor_redis: 763 if not _anchor_redis:
502 return 764 return
  765 + schema = _get_analysis_schema(
  766 + analysis_kind,
  767 + category_taxonomy_profile=category_taxonomy_profile,
  768 + )
503 try: 769 try:
504 - normalized = _normalize_analysis_result(result, product=product, target_lang=target_lang)  
505 - if not _has_meaningful_analysis_content(normalized): 770 + normalized = _normalize_analysis_result(
  771 + result,
  772 + product=product,
  773 + target_lang=target_lang,
  774 + schema=schema,
  775 + )
  776 + if not _has_meaningful_analysis_content(normalized, schema):
506 return 777 return
507 - key = _make_anchor_cache_key(product, target_lang) 778 + key = _make_analysis_cache_key(
  779 + product,
  780 + target_lang,
  781 + analysis_kind,
  782 + category_taxonomy_profile=category_taxonomy_profile,
  783 + )
508 ttl = ANCHOR_CACHE_EXPIRE_DAYS * 24 * 3600 784 ttl = ANCHOR_CACHE_EXPIRE_DAYS * 24 * 3600
509 _anchor_redis.setex(key, ttl, json.dumps(normalized, ensure_ascii=False)) 785 _anchor_redis.setex(key, ttl, json.dumps(normalized, ensure_ascii=False))
510 except Exception as e: 786 except Exception as e:
511 - logger.warning(f"Failed to set anchor cache: {e}") 787 + logger.warning("Failed to set %s analysis cache: %s", analysis_kind, e)
  788 +
  789 +
  790 +def _set_cached_anchor_result(
  791 + product: Dict[str, Any],
  792 + target_lang: str,
  793 + result: Dict[str, Any],
  794 +) -> None:
  795 + _set_cached_analysis_result(product, target_lang, result, analysis_kind="content")
512 796
513 797
514 def _build_assistant_prefix(headers: List[str]) -> str: 798 def _build_assistant_prefix(headers: List[str]) -> str:
@@ -517,8 +801,8 @@ def _build_assistant_prefix(headers: List[str]) -&gt; str: @@ -517,8 +801,8 @@ def _build_assistant_prefix(headers: List[str]) -&gt; str:
517 return f"{header_line}\n{separator_line}\n" 801 return f"{header_line}\n{separator_line}\n"
518 802
519 803
520 -def _build_shared_context(products: List[Dict[str, str]]) -> str:  
521 - shared_context = SHARED_ANALYSIS_INSTRUCTION 804 +def _build_shared_context(products: List[Dict[str, str]], schema: AnalysisSchema) -> str:
  805 + shared_context = schema.shared_instruction
522 for idx, product in enumerate(products, 1): 806 for idx, product in enumerate(products, 1):
523 prompt_input = _build_prompt_input_text(product) 807 prompt_input = _build_prompt_input_text(product)
524 shared_context += f"{idx}. {prompt_input}\n" 808 shared_context += f"{idx}. {prompt_input}\n"
@@ -550,16 +834,23 @@ def reset_logged_shared_context_keys() -&gt; None: @@ -550,16 +834,23 @@ def reset_logged_shared_context_keys() -&gt; None:
550 def create_prompt( 834 def create_prompt(
551 products: List[Dict[str, str]], 835 products: List[Dict[str, str]],
552 target_lang: str = "zh", 836 target_lang: str = "zh",
553 -) -> Tuple[str, str, str]: 837 + analysis_kind: str = "content",
  838 + category_taxonomy_profile: Optional[str] = None,
  839 +) -> Tuple[Optional[str], Optional[str], Optional[str]]:
554 """根据目标语言创建共享上下文、本地化输出要求和 Partial Mode assistant 前缀。""" 840 """根据目标语言创建共享上下文、本地化输出要求和 Partial Mode assistant 前缀。"""
555 - markdown_table_headers = LANGUAGE_MARKDOWN_TABLE_HEADERS.get(target_lang) 841 + schema = _get_analysis_schema(
  842 + analysis_kind,
  843 + category_taxonomy_profile=category_taxonomy_profile,
  844 + )
  845 + markdown_table_headers = schema.get_headers(target_lang)
556 if not markdown_table_headers: 846 if not markdown_table_headers:
557 logger.warning( 847 logger.warning(
558 - "Unsupported target_lang for markdown table headers: %s", 848 + "Unsupported target_lang for markdown table headers: kind=%s lang=%s",
  849 + analysis_kind,
559 target_lang, 850 target_lang,
560 ) 851 )
561 return None, None, None 852 return None, None, None
562 - shared_context = _build_shared_context(products) 853 + shared_context = _build_shared_context(products, schema)
563 language_label = SOURCE_LANG_CODE_MAP.get(target_lang, target_lang) 854 language_label = SOURCE_LANG_CODE_MAP.get(target_lang, target_lang)
564 user_prompt = USER_INSTRUCTION_TEMPLATE.format(language=language_label).strip() 855 user_prompt = USER_INSTRUCTION_TEMPLATE.format(language=language_label).strip()
565 assistant_prefix = _build_assistant_prefix(markdown_table_headers) 856 assistant_prefix = _build_assistant_prefix(markdown_table_headers)
@@ -592,6 +883,7 @@ def call_llm( @@ -592,6 +883,7 @@ def call_llm(
592 user_prompt: str, 883 user_prompt: str,
593 assistant_prefix: str, 884 assistant_prefix: str,
594 target_lang: str = "zh", 885 target_lang: str = "zh",
  886 + analysis_kind: str = "content",
595 ) -> Tuple[str, str]: 887 ) -> Tuple[str, str]:
596 """调用大模型 API(带重试机制),使用 Partial Mode 强制 markdown 表格前缀。""" 888 """调用大模型 API(带重试机制),使用 Partial Mode 强制 markdown 表格前缀。"""
597 headers = { 889 headers = {
@@ -631,8 +923,9 @@ def call_llm( @@ -631,8 +923,9 @@ def call_llm(
631 if _mark_shared_context_logged_once(shared_context_key): 923 if _mark_shared_context_logged_once(shared_context_key):
632 logger.info(f"\n{'=' * 80}") 924 logger.info(f"\n{'=' * 80}")
633 logger.info( 925 logger.info(
634 - "LLM Shared Context [model=%s, shared_key=%s, chars=%s] (logged once per process key)", 926 + "LLM Shared Context [model=%s, kind=%s, shared_key=%s, chars=%s] (logged once per process key)",
635 MODEL_NAME, 927 MODEL_NAME,
  928 + analysis_kind,
636 shared_context_key, 929 shared_context_key,
637 len(shared_context), 930 len(shared_context),
638 ) 931 )
@@ -641,8 +934,9 @@ def call_llm( @@ -641,8 +934,9 @@ def call_llm(
641 934
642 verbose_logger.info(f"\n{'=' * 80}") 935 verbose_logger.info(f"\n{'=' * 80}")
643 verbose_logger.info( 936 verbose_logger.info(
644 - "LLM Request [model=%s, lang=%s, shared_key=%s, tail_key=%s]:", 937 + "LLM Request [model=%s, kind=%s, lang=%s, shared_key=%s, tail_key=%s]:",
645 MODEL_NAME, 938 MODEL_NAME,
  939 + analysis_kind,
646 target_lang, 940 target_lang,
647 shared_context_key, 941 shared_context_key,
648 localized_tail_key, 942 localized_tail_key,
@@ -654,7 +948,8 @@ def call_llm( @@ -654,7 +948,8 @@ def call_llm(
654 verbose_logger.info(f"\nAssistant Prefix:\n{assistant_prefix}") 948 verbose_logger.info(f"\nAssistant Prefix:\n{assistant_prefix}")
655 949
656 logger.info( 950 logger.info(
657 - "\nLLM Request Variant [lang=%s, shared_key=%s, tail_key=%s, prompt_chars=%s, prefix_chars=%s]", 951 + "\nLLM Request Variant [kind=%s, lang=%s, shared_key=%s, tail_key=%s, prompt_chars=%s, prefix_chars=%s]",
  952 + analysis_kind,
658 target_lang, 953 target_lang,
659 shared_context_key, 954 shared_context_key,
660 localized_tail_key, 955 localized_tail_key,
@@ -685,8 +980,9 @@ def call_llm( @@ -685,8 +980,9 @@ def call_llm(
685 usage = result.get("usage") or {} 980 usage = result.get("usage") or {}
686 981
687 verbose_logger.info( 982 verbose_logger.info(
688 - "\nLLM Response [model=%s, lang=%s, shared_key=%s, tail_key=%s]:", 983 + "\nLLM Response [model=%s, kind=%s, lang=%s, shared_key=%s, tail_key=%s]:",
689 MODEL_NAME, 984 MODEL_NAME,
  985 + analysis_kind,
690 target_lang, 986 target_lang,
691 shared_context_key, 987 shared_context_key,
692 localized_tail_key, 988 localized_tail_key,
@@ -697,7 +993,8 @@ def call_llm( @@ -697,7 +993,8 @@ def call_llm(
697 full_markdown = _merge_partial_response(assistant_prefix, generated_content) 993 full_markdown = _merge_partial_response(assistant_prefix, generated_content)
698 994
699 logger.info( 995 logger.info(
700 - "\nLLM Response Summary [lang=%s, shared_key=%s, tail_key=%s, generated_chars=%s, completion_tokens=%s, prompt_tokens=%s, total_tokens=%s]", 996 + "\nLLM Response Summary [kind=%s, lang=%s, shared_key=%s, tail_key=%s, generated_chars=%s, completion_tokens=%s, prompt_tokens=%s, total_tokens=%s]",
  997 + analysis_kind,
701 target_lang, 998 target_lang,
702 shared_context_key, 999 shared_context_key,
703 localized_tail_key, 1000 localized_tail_key,
@@ -742,8 +1039,16 @@ def call_llm( @@ -742,8 +1039,16 @@ def call_llm(
742 session.close() 1039 session.close()
743 1040
744 1041
745 -def parse_markdown_table(markdown_content: str) -> List[Dict[str, str]]: 1042 +def parse_markdown_table(
  1043 + markdown_content: str,
  1044 + analysis_kind: str = "content",
  1045 + category_taxonomy_profile: Optional[str] = None,
  1046 +) -> List[Dict[str, str]]:
746 """解析markdown表格内容""" 1047 """解析markdown表格内容"""
  1048 + schema = _get_analysis_schema(
  1049 + analysis_kind,
  1050 + category_taxonomy_profile=category_taxonomy_profile,
  1051 + )
747 lines = markdown_content.strip().split("\n") 1052 lines = markdown_content.strip().split("\n")
748 data = [] 1053 data = []
749 data_started = False 1054 data_started = False
@@ -768,22 +1073,16 @@ def parse_markdown_table(markdown_content: str) -&gt; List[Dict[str, str]]: @@ -768,22 +1073,16 @@ def parse_markdown_table(markdown_content: str) -&gt; List[Dict[str, str]]:
768 1073
769 # 解析数据行 1074 # 解析数据行
770 parts = [p.strip() for p in line.split("|")] 1075 parts = [p.strip() for p in line.split("|")]
771 - parts = [p for p in parts if p] # 移除空字符串 1076 + if parts and parts[0] == "":
  1077 + parts = parts[1:]
  1078 + if parts and parts[-1] == "":
  1079 + parts = parts[:-1]
772 1080
773 if len(parts) >= 2: 1081 if len(parts) >= 2:
774 - row = {  
775 - "seq_no": parts[0],  
776 - "title": parts[1], # 商品标题(按目标语言)  
777 - "category_path": parts[2] if len(parts) > 2 else "", # 品类路径  
778 - "tags": parts[3] if len(parts) > 3 else "", # 细分标签  
779 - "target_audience": parts[4] if len(parts) > 4 else "", # 适用人群  
780 - "usage_scene": parts[5] if len(parts) > 5 else "", # 使用场景  
781 - "season": parts[6] if len(parts) > 6 else "", # 适用季节  
782 - "key_attributes": parts[7] if len(parts) > 7 else "", # 关键属性  
783 - "material": parts[8] if len(parts) > 8 else "", # 材质说明  
784 - "features": parts[9] if len(parts) > 9 else "", # 功能特点  
785 - "anchor_text": parts[10] if len(parts) > 10 else "", # 锚文本  
786 - } 1082 + row = {"seq_no": parts[0]}
  1083 + for field_index, field_name in enumerate(schema.result_fields, start=1):
  1084 + cell = parts[field_index] if len(parts) > field_index else ""
  1085 + row[field_name] = _normalize_markdown_table_cell(cell)
787 data.append(row) 1086 data.append(row)
788 1087
789 return data 1088 return data
@@ -794,31 +1093,49 @@ def _log_parsed_result_quality( @@ -794,31 +1093,49 @@ def _log_parsed_result_quality(
794 parsed_results: List[Dict[str, str]], 1093 parsed_results: List[Dict[str, str]],
795 target_lang: str, 1094 target_lang: str,
796 batch_num: int, 1095 batch_num: int,
  1096 + analysis_kind: str,
  1097 + category_taxonomy_profile: Optional[str] = None,
797 ) -> None: 1098 ) -> None:
  1099 + schema = _get_analysis_schema(
  1100 + analysis_kind,
  1101 + category_taxonomy_profile=category_taxonomy_profile,
  1102 + )
798 expected = len(batch_data) 1103 expected = len(batch_data)
799 actual = len(parsed_results) 1104 actual = len(parsed_results)
800 if actual != expected: 1105 if actual != expected:
801 logger.warning( 1106 logger.warning(
802 - "Parsed row count mismatch for batch=%s lang=%s: expected=%s actual=%s", 1107 + "Parsed row count mismatch for kind=%s batch=%s lang=%s: expected=%s actual=%s",
  1108 + analysis_kind,
803 batch_num, 1109 batch_num,
804 target_lang, 1110 target_lang,
805 expected, 1111 expected,
806 actual, 1112 actual,
807 ) 1113 )
808 1114
809 - missing_anchor = sum(1 for item in parsed_results if not str(item.get("anchor_text") or "").strip())  
810 - missing_category = sum(1 for item in parsed_results if not str(item.get("category_path") or "").strip())  
811 - missing_title = sum(1 for item in parsed_results if not str(item.get("title") or "").strip()) 1115 + if not schema.quality_fields:
  1116 + logger.info(
  1117 + "Parsed Quality Summary [kind=%s, batch=%s, lang=%s]: rows=%s/%s",
  1118 + analysis_kind,
  1119 + batch_num,
  1120 + target_lang,
  1121 + actual,
  1122 + expected,
  1123 + )
  1124 + return
812 1125
  1126 + missing_summary = ", ".join(
  1127 + f"missing_{field}="
  1128 + f"{sum(1 for item in parsed_results if not str(item.get(field) or '').strip())}"
  1129 + for field in schema.quality_fields
  1130 + )
813 logger.info( 1131 logger.info(
814 - "Parsed Quality Summary [batch=%s, lang=%s]: rows=%s/%s, missing_title=%s, missing_category=%s, missing_anchor=%s", 1132 + "Parsed Quality Summary [kind=%s, batch=%s, lang=%s]: rows=%s/%s, %s",
  1133 + analysis_kind,
815 batch_num, 1134 batch_num,
816 target_lang, 1135 target_lang,
817 actual, 1136 actual,
818 expected, 1137 expected,
819 - missing_title,  
820 - missing_category,  
821 - missing_anchor, 1138 + missing_summary,
822 ) 1139 )
823 1140
824 1141
@@ -826,29 +1143,44 @@ def process_batch( @@ -826,29 +1143,44 @@ def process_batch(
826 batch_data: List[Dict[str, str]], 1143 batch_data: List[Dict[str, str]],
827 batch_num: int, 1144 batch_num: int,
828 target_lang: str = "zh", 1145 target_lang: str = "zh",
  1146 + analysis_kind: str = "content",
  1147 + category_taxonomy_profile: Optional[str] = None,
829 ) -> List[Dict[str, Any]]: 1148 ) -> List[Dict[str, Any]]:
830 """处理一个批次的数据""" 1149 """处理一个批次的数据"""
  1150 + schema = _get_analysis_schema(
  1151 + analysis_kind,
  1152 + category_taxonomy_profile=category_taxonomy_profile,
  1153 + )
831 logger.info(f"\n{'#' * 80}") 1154 logger.info(f"\n{'#' * 80}")
832 - logger.info(f"Processing Batch {batch_num} ({len(batch_data)} items)") 1155 + logger.info(
  1156 + "Processing Batch %s (%s items, kind=%s)",
  1157 + batch_num,
  1158 + len(batch_data),
  1159 + analysis_kind,
  1160 + )
833 1161
834 # 创建提示词 1162 # 创建提示词
835 shared_context, user_prompt, assistant_prefix = create_prompt( 1163 shared_context, user_prompt, assistant_prefix = create_prompt(
836 batch_data, 1164 batch_data,
837 target_lang=target_lang, 1165 target_lang=target_lang,
  1166 + analysis_kind=analysis_kind,
  1167 + category_taxonomy_profile=category_taxonomy_profile,
838 ) 1168 )
839 1169
840 # 如果提示词创建失败(例如不支持的 target_lang),本次批次整体失败,不再继续调用 LLM 1170 # 如果提示词创建失败(例如不支持的 target_lang),本次批次整体失败,不再继续调用 LLM
841 if shared_context is None or user_prompt is None or assistant_prefix is None: 1171 if shared_context is None or user_prompt is None or assistant_prefix is None:
842 logger.error( 1172 logger.error(
843 - "Failed to create prompt for batch %s, target_lang=%s; " 1173 + "Failed to create prompt for batch %s, kind=%s, target_lang=%s; "
844 "marking entire batch as failed without calling LLM", 1174 "marking entire batch as failed without calling LLM",
845 batch_num, 1175 batch_num,
  1176 + analysis_kind,
846 target_lang, 1177 target_lang,
847 ) 1178 )
848 return [ 1179 return [
849 _make_empty_analysis_result( 1180 _make_empty_analysis_result(
850 item, 1181 item,
851 target_lang, 1182 target_lang,
  1183 + schema,
852 error=f"prompt_creation_failed: unsupported target_lang={target_lang}", 1184 error=f"prompt_creation_failed: unsupported target_lang={target_lang}",
853 ) 1185 )
854 for item in batch_data 1186 for item in batch_data
@@ -861,11 +1193,23 @@ def process_batch( @@ -861,11 +1193,23 @@ def process_batch(
861 user_prompt, 1193 user_prompt,
862 assistant_prefix, 1194 assistant_prefix,
863 target_lang=target_lang, 1195 target_lang=target_lang,
  1196 + analysis_kind=analysis_kind,
864 ) 1197 )
865 1198
866 # 解析结果 1199 # 解析结果
867 - parsed_results = parse_markdown_table(raw_response)  
868 - _log_parsed_result_quality(batch_data, parsed_results, target_lang, batch_num) 1200 + parsed_results = parse_markdown_table(
  1201 + raw_response,
  1202 + analysis_kind=analysis_kind,
  1203 + category_taxonomy_profile=category_taxonomy_profile,
  1204 + )
  1205 + _log_parsed_result_quality(
  1206 + batch_data,
  1207 + parsed_results,
  1208 + target_lang,
  1209 + batch_num,
  1210 + analysis_kind,
  1211 + category_taxonomy_profile,
  1212 + )
869 1213
870 logger.info(f"\nParsed Results ({len(parsed_results)} items):") 1214 logger.info(f"\nParsed Results ({len(parsed_results)} items):")
871 logger.info(json.dumps(parsed_results, ensure_ascii=False, indent=2)) 1215 logger.info(json.dumps(parsed_results, ensure_ascii=False, indent=2))
@@ -879,10 +1223,12 @@ def process_batch( @@ -879,10 +1223,12 @@ def process_batch(
879 parsed_item, 1223 parsed_item,
880 product=source_product, 1224 product=source_product,
881 target_lang=target_lang, 1225 target_lang=target_lang,
  1226 + schema=schema,
882 ) 1227 )
883 results_with_ids.append(result) 1228 results_with_ids.append(result)
884 logger.info( 1229 logger.info(
885 - "Mapped: seq=%s -> original_id=%s", 1230 + "Mapped: kind=%s seq=%s -> original_id=%s",
  1231 + analysis_kind,
886 parsed_item.get("seq_no"), 1232 parsed_item.get("seq_no"),
887 source_product.get("id"), 1233 source_product.get("id"),
888 ) 1234 )
@@ -890,6 +1236,7 @@ def process_batch( @@ -890,6 +1236,7 @@ def process_batch(
890 # 保存批次 JSON 日志到独立文件 1236 # 保存批次 JSON 日志到独立文件
891 batch_log = { 1237 batch_log = {
892 "batch_num": batch_num, 1238 "batch_num": batch_num,
  1239 + "analysis_kind": analysis_kind,
893 "timestamp": datetime.now().isoformat(), 1240 "timestamp": datetime.now().isoformat(),
894 "input_products": batch_data, 1241 "input_products": batch_data,
895 "raw_response": raw_response, 1242 "raw_response": raw_response,
@@ -900,7 +1247,10 @@ def process_batch( @@ -900,7 +1247,10 @@ def process_batch(
900 1247
901 # 并发写 batch json 日志时,保证文件名唯一避免覆盖 1248 # 并发写 batch json 日志时,保证文件名唯一避免覆盖
902 batch_call_id = uuid.uuid4().hex[:12] 1249 batch_call_id = uuid.uuid4().hex[:12]
903 - batch_log_file = LOG_DIR / f"batch_{batch_num:04d}_{timestamp}_{batch_call_id}.json" 1250 + batch_log_file = (
  1251 + LOG_DIR
  1252 + / f"batch_{analysis_kind}_{batch_num:04d}_{timestamp}_{batch_call_id}.json"
  1253 + )
904 with open(batch_log_file, "w", encoding="utf-8") as f: 1254 with open(batch_log_file, "w", encoding="utf-8") as f:
905 json.dump(batch_log, f, ensure_ascii=False, indent=2) 1255 json.dump(batch_log, f, ensure_ascii=False, indent=2)
906 1256
@@ -912,7 +1262,7 @@ def process_batch( @@ -912,7 +1262,7 @@ def process_batch(
912 logger.error(f"Error processing batch {batch_num}: {str(e)}", exc_info=True) 1262 logger.error(f"Error processing batch {batch_num}: {str(e)}", exc_info=True)
913 # 返回空结果,保持ID映射 1263 # 返回空结果,保持ID映射
914 return [ 1264 return [
915 - _make_empty_analysis_result(item, target_lang, error=str(e)) 1265 + _make_empty_analysis_result(item, target_lang, schema, error=str(e))
916 for item in batch_data 1266 for item in batch_data
917 ] 1267 ]
918 1268
@@ -922,6 +1272,8 @@ def analyze_products( @@ -922,6 +1272,8 @@ def analyze_products(
922 target_lang: str = "zh", 1272 target_lang: str = "zh",
923 batch_size: Optional[int] = None, 1273 batch_size: Optional[int] = None,
924 tenant_id: Optional[str] = None, 1274 tenant_id: Optional[str] = None,
  1275 + analysis_kind: str = "content",
  1276 + category_taxonomy_profile: Optional[str] = None,
925 ) -> List[Dict[str, Any]]: 1277 ) -> List[Dict[str, Any]]:
926 """ 1278 """
927 库调用入口:根据输入+语言,返回锚文本及各维度信息。 1279 库调用入口:根据输入+语言,返回锚文本及各维度信息。
@@ -937,6 +1289,10 @@ def analyze_products( @@ -937,6 +1289,10 @@ def analyze_products(
937 if not products: 1289 if not products:
938 return [] 1290 return []
939 1291
  1292 + _get_analysis_schema(
  1293 + analysis_kind,
  1294 + category_taxonomy_profile=category_taxonomy_profile,
  1295 + )
940 results_by_index: List[Optional[Dict[str, Any]]] = [None] * len(products) 1296 results_by_index: List[Optional[Dict[str, Any]]] = [None] * len(products)
941 uncached_items: List[Tuple[int, Dict[str, str]]] = [] 1297 uncached_items: List[Tuple[int, Dict[str, str]]] = []
942 1298
@@ -946,11 +1302,16 @@ def analyze_products( @@ -946,11 +1302,16 @@ def analyze_products(
946 uncached_items.append((idx, product)) 1302 uncached_items.append((idx, product))
947 continue 1303 continue
948 1304
949 - cached = _get_cached_anchor_result(product, target_lang) 1305 + cached = _get_cached_analysis_result(
  1306 + product,
  1307 + target_lang,
  1308 + analysis_kind,
  1309 + category_taxonomy_profile=category_taxonomy_profile,
  1310 + )
950 if cached: 1311 if cached:
951 logger.info( 1312 logger.info(
952 f"[analyze_products] Cache hit for title='{title[:50]}...', " 1313 f"[analyze_products] Cache hit for title='{title[:50]}...', "
953 - f"lang={target_lang}" 1314 + f"kind={analysis_kind}, lang={target_lang}"
954 ) 1315 )
955 results_by_index[idx] = cached 1316 results_by_index[idx] = cached
956 continue 1317 continue
@@ -979,9 +1340,15 @@ def analyze_products( @@ -979,9 +1340,15 @@ def analyze_products(
979 for batch_num, batch_slice, batch in batch_jobs: 1340 for batch_num, batch_slice, batch in batch_jobs:
980 logger.info( 1341 logger.info(
981 f"[analyze_products] Processing batch {batch_num}/{total_batches}, " 1342 f"[analyze_products] Processing batch {batch_num}/{total_batches}, "
982 - f"size={len(batch)}, target_lang={target_lang}" 1343 + f"size={len(batch)}, kind={analysis_kind}, target_lang={target_lang}"
  1344 + )
  1345 + batch_results = process_batch(
  1346 + batch,
  1347 + batch_num=batch_num,
  1348 + target_lang=target_lang,
  1349 + analysis_kind=analysis_kind,
  1350 + category_taxonomy_profile=category_taxonomy_profile,
983 ) 1351 )
984 - batch_results = process_batch(batch, batch_num=batch_num, target_lang=target_lang)  
985 1352
986 for (original_idx, product), item in zip(batch_slice, batch_results): 1353 for (original_idx, product), item in zip(batch_slice, batch_results):
987 results_by_index[original_idx] = item 1354 results_by_index[original_idx] = item
@@ -992,7 +1359,13 @@ def analyze_products( @@ -992,7 +1359,13 @@ def analyze_products(
992 # 不缓存错误结果,避免放大临时故障 1359 # 不缓存错误结果,避免放大临时故障
993 continue 1360 continue
994 try: 1361 try:
995 - _set_cached_anchor_result(product, target_lang, item) 1362 + _set_cached_analysis_result(
  1363 + product,
  1364 + target_lang,
  1365 + item,
  1366 + analysis_kind,
  1367 + category_taxonomy_profile=category_taxonomy_profile,
  1368 + )
996 except Exception: 1369 except Exception:
997 # 已在内部记录 warning 1370 # 已在内部记录 warning
998 pass 1371 pass
@@ -1000,10 +1373,11 @@ def analyze_products( @@ -1000,10 +1373,11 @@ def analyze_products(
1000 max_workers = min(CONTENT_UNDERSTANDING_MAX_WORKERS, len(batch_jobs)) 1373 max_workers = min(CONTENT_UNDERSTANDING_MAX_WORKERS, len(batch_jobs))
1001 logger.info( 1374 logger.info(
1002 "[analyze_products] Using ThreadPoolExecutor for uncached batches: " 1375 "[analyze_products] Using ThreadPoolExecutor for uncached batches: "
1003 - "max_workers=%s, total_batches=%s, bs=%s, target_lang=%s", 1376 + "max_workers=%s, total_batches=%s, bs=%s, kind=%s, target_lang=%s",
1004 max_workers, 1377 max_workers,
1005 total_batches, 1378 total_batches,
1006 bs, 1379 bs,
  1380 + analysis_kind,
1007 target_lang, 1381 target_lang,
1008 ) 1382 )
1009 1383
@@ -1013,7 +1387,12 @@ def analyze_products( @@ -1013,7 +1387,12 @@ def analyze_products(
1013 future_by_batch_num: Dict[int, Any] = {} 1387 future_by_batch_num: Dict[int, Any] = {}
1014 for batch_num, _batch_slice, batch in batch_jobs: 1388 for batch_num, _batch_slice, batch in batch_jobs:
1015 future_by_batch_num[batch_num] = executor.submit( 1389 future_by_batch_num[batch_num] = executor.submit(
1016 - process_batch, batch, batch_num=batch_num, target_lang=target_lang 1390 + process_batch,
  1391 + batch,
  1392 + batch_num=batch_num,
  1393 + target_lang=target_lang,
  1394 + analysis_kind=analysis_kind,
  1395 + category_taxonomy_profile=category_taxonomy_profile,
1017 ) 1396 )
1018 1397
1019 # 按 batch_num 回填,确保输出稳定(results_by_index 是按原始 input index 映射的) 1398 # 按 batch_num 回填,确保输出稳定(results_by_index 是按原始 input index 映射的)
@@ -1028,7 +1407,13 @@ def analyze_products( @@ -1028,7 +1407,13 @@ def analyze_products(
1028 # 不缓存错误结果,避免放大临时故障 1407 # 不缓存错误结果,避免放大临时故障
1029 continue 1408 continue
1030 try: 1409 try:
1031 - _set_cached_anchor_result(product, target_lang, item) 1410 + _set_cached_analysis_result(
  1411 + product,
  1412 + target_lang,
  1413 + item,
  1414 + analysis_kind,
  1415 + category_taxonomy_profile=category_taxonomy_profile,
  1416 + )
1032 except Exception: 1417 except Exception:
1033 # 已在内部记录 warning 1418 # 已在内部记录 warning
1034 pass 1419 pass
indexer/product_enrich_prompts.py
1 #!/usr/bin/env python3 1 #!/usr/bin/env python3
2 2
3 -from typing import Any, Dict 3 +from typing import Any, Dict, Tuple
4 4
5 SYSTEM_MESSAGE = ( 5 SYSTEM_MESSAGE = (
6 "You are an e-commerce product annotator. " 6 "You are an e-commerce product annotator. "
@@ -33,6 +33,337 @@ Input product list: @@ -33,6 +33,337 @@ Input product list:
33 USER_INSTRUCTION_TEMPLATE = """Please strictly return a Markdown table following the given columns in the specified language. For any column containing multiple values, separate them with commas. Do not add any other explanation. 33 USER_INSTRUCTION_TEMPLATE = """Please strictly return a Markdown table following the given columns in the specified language. For any column containing multiple values, separate them with commas. Do not add any other explanation.
34 Language: {language}""" 34 Language: {language}"""
35 35
  36 +def _taxonomy_field(
  37 + key: str,
  38 + label: str,
  39 + description: str,
  40 + zh_label: str | None = None,
  41 +) -> Dict[str, str]:
  42 + return {
  43 + "key": key,
  44 + "label": label,
  45 + "description": description,
  46 + "zh_label": zh_label or label,
  47 + }
  48 +
  49 +
  50 +def _build_taxonomy_shared_instruction(profile_label: str, fields: Tuple[Dict[str, str], ...]) -> str:
  51 + lines = [
  52 + f"Analyze each input product text and fill the columns below using a {profile_label} attribute taxonomy.",
  53 + "",
  54 + "Output columns:",
  55 + ]
  56 + for idx, field in enumerate(fields, start=1):
  57 + lines.append(f"{idx}. {field['label']}: {field['description']}")
  58 + lines.extend(
  59 + [
  60 + "",
  61 + "Rules:",
  62 + "- Keep the same row order and row count as input.",
  63 + "- Leave blank if not applicable, unmentioned, or unsupported.",
  64 + "- Use concise, standardized ecommerce wording.",
  65 + "- If multiple values, separate with commas.",
  66 + "",
  67 + "Input product list:",
  68 + ]
  69 + )
  70 + return "\n".join(lines)
  71 +
  72 +
  73 +def _make_taxonomy_profile(
  74 + profile_label: str,
  75 + fields: Tuple[Dict[str, str], ...],
  76 +) -> Dict[str, Any]:
  77 + headers = {
  78 + "en": ["No.", *[field["label"] for field in fields]],
  79 + "zh": ["序号", *[field["zh_label"] for field in fields]],
  80 + }
  81 + return {
  82 + "profile_label": profile_label,
  83 + "fields": fields,
  84 + "shared_instruction": _build_taxonomy_shared_instruction(profile_label, fields),
  85 + "markdown_table_headers": headers,
  86 + }
  87 +
  88 +
  89 +APPAREL_TAXONOMY_FIELDS = (
  90 + _taxonomy_field("product_type", "Product Type", "concise ecommerce apparel category label, not a full marketing title", "品类"),
  91 + _taxonomy_field("target_gender", "Target Gender", "intended gender only if clearly implied", "目标性别"),
  92 + _taxonomy_field("age_group", "Age Group", "only if clearly implied, e.g. adults, kids, teens, toddlers, babies", "年龄段"),
  93 + _taxonomy_field("season", "Season", "season(s) or all-season suitability only if supported", "适用季节"),
  94 + _taxonomy_field("fit", "Fit", "body closeness, e.g. slim, regular, relaxed, oversized, fitted", "版型"),
  95 + _taxonomy_field("silhouette", "Silhouette", "overall garment shape, e.g. straight, A-line, boxy, tapered, bodycon, wide-leg", "廓形"),
  96 + _taxonomy_field("neckline", "Neckline", "neckline type when applicable, e.g. crew neck, V-neck, hooded, collared, square neck", "领型"),
  97 + _taxonomy_field("sleeve_length_type", "Sleeve Length Type", "sleeve length only, e.g. sleeveless, short sleeve, long sleeve, three-quarter sleeve", "袖长类型"),
  98 + _taxonomy_field("sleeve_style", "Sleeve Style", "sleeve design only, e.g. puff sleeve, raglan sleeve, batwing sleeve, bell sleeve", "袖型"),
  99 + _taxonomy_field("strap_type", "Strap Type", "strap design when applicable, e.g. spaghetti strap, wide strap, halter strap, adjustable strap", "肩带设计"),
  100 + _taxonomy_field("rise_waistline", "Rise / Waistline", "waist placement when applicable, e.g. high rise, mid rise, low rise, empire waist", "腰型"),
  101 + _taxonomy_field("leg_shape", "Leg Shape", "for bottoms only, e.g. straight leg, wide leg, flare leg, tapered leg, skinny leg", "裤型"),
  102 + _taxonomy_field("skirt_shape", "Skirt Shape", "for skirts only, e.g. A-line, pleated, pencil, mermaid", "裙型"),
  103 + _taxonomy_field("length_type", "Length Type", "design length only, not size, e.g. cropped, regular, longline, mini, midi, maxi, ankle length, full length", "长度类型"),
  104 + _taxonomy_field("closure_type", "Closure Type", "fastening method when applicable, e.g. zipper, button, drawstring, elastic waist, hook-and-loop", "闭合方式"),
  105 + _taxonomy_field("design_details", "Design Details", "construction or visual details, e.g. ruched, ruffled, pleated, cut-out, layered, distressed, split hem", "设计细节"),
  106 + _taxonomy_field("fabric", "Fabric", "fabric type only, e.g. denim, knit, chiffon, jersey, fleece, cotton twill", "面料"),
  107 + _taxonomy_field("material_composition", "Material Composition", "fiber content or blend only if stated, e.g. cotton, polyester, spandex, linen blend, 95% cotton 5% elastane", "成分"),
  108 + _taxonomy_field("fabric_properties", "Fabric Properties", "inherent fabric traits, e.g. stretch, breathable, lightweight, soft-touch, water-resistant", "面料特性"),
  109 + _taxonomy_field("clothing_features", "Clothing Features", "product features, e.g. lined, reversible, hooded, packable, padded, pocketed", "服装特征"),
  110 + _taxonomy_field("functional_benefits", "Functional Benefits", "wearer benefits, e.g. moisture-wicking, thermal insulation, UV protection, easy care, supportive compression", "功能"),
  111 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  112 + _taxonomy_field("color_family", "Color Family", "normalized broad retail color group, e.g. black, white, blue, green, red, pink, beige, brown, gray", "色系"),
  113 + _taxonomy_field("print_pattern", "Print / Pattern", "surface pattern when applicable, e.g. solid, striped, plaid, floral, graphic, animal print", "印花 / 图案"),
  114 + _taxonomy_field("occasion_end_use", "Occasion / End Use", "likely use occasion only if supported, e.g. office, casual wear, streetwear, lounge, workout, outdoor", "适用场景"),
  115 + _taxonomy_field("style_aesthetic", "Style Aesthetic", "overall style only if supported, e.g. minimalist, streetwear, athleisure, smart casual, romantic, playful", "风格"),
  116 +)
  117 +
  118 +THREE_C_TAXONOMY_FIELDS = (
  119 + _taxonomy_field("product_type", "Product Type", "concise 3C accessory or peripheral category label", "品类"),
  120 + _taxonomy_field("compatible_device", "Compatible Device / Model", "supported device family, series, model, or form factor when clearly stated", "适配设备 / 型号"),
  121 + _taxonomy_field("connectivity", "Connectivity", "connection method such as wired, wireless, Bluetooth, Wi-Fi, NFC, or 2.4G", "连接方式"),
  122 + _taxonomy_field("interface_port_type", "Interface / Port Type", "relevant connector or port, e.g. USB-C, Lightning, HDMI, AUX, RJ45", "接口 / 端口类型"),
  123 + _taxonomy_field("power_charging", "Power Source / Charging", "charging or power mode, e.g. battery powered, fast charging, rechargeable, plug-in", "供电 / 充电方式"),
  124 + _taxonomy_field("key_features", "Key Features", "primary hardware features such as noise cancelling, foldable, magnetic, backlit, waterproof", "关键特征"),
  125 + _taxonomy_field("material_finish", "Material / Finish", "main material or exterior finish when supported", "材质 / 表面处理"),
  126 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  127 + _taxonomy_field("pack_size", "Pack Size", "unit count or bundle size when stated", "包装规格"),
  128 + _taxonomy_field("use_case", "Use Case", "intended usage such as travel, office, gaming, car, charging, streaming", "使用场景"),
  129 +)
  130 +
  131 +BAGS_TAXONOMY_FIELDS = (
  132 + _taxonomy_field("product_type", "Product Type", "concise bag category such as backpack, tote bag, crossbody bag, luggage, or wallet", "品类"),
  133 + _taxonomy_field("target_gender", "Target Gender", "intended gender only if clearly implied", "目标性别"),
  134 + _taxonomy_field("carry_style", "Carry Style", "how the bag is worn or carried, e.g. handheld, shoulder, crossbody, backpack", "携带方式"),
  135 + _taxonomy_field("size_capacity", "Size / Capacity", "size tier or capacity when supported, e.g. mini, large capacity, 20L", "尺寸 / 容量"),
  136 + _taxonomy_field("material", "Material", "main bag material such as leather, nylon, canvas, PU, straw", "材质"),
  137 + _taxonomy_field("closure_type", "Closure Type", "bag closure such as zipper, flap, buckle, drawstring, magnetic snap", "闭合方式"),
  138 + _taxonomy_field("structure_compartments", "Structure / Compartments", "organizational structure such as multi-pocket, laptop sleeve, card slots, expandable", "结构 / 分层"),
  139 + _taxonomy_field("strap_handle_type", "Strap / Handle Type", "strap or handle design such as chain strap, top handle, adjustable strap", "肩带 / 提手类型"),
  140 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  141 + _taxonomy_field("occasion_end_use", "Occasion / End Use", "likely use such as commute, travel, evening, school, casual", "适用场景"),
  142 +)
  143 +
  144 +PET_SUPPLIES_TAXONOMY_FIELDS = (
  145 + _taxonomy_field("product_type", "Product Type", "concise pet supplies category label", "品类"),
  146 + _taxonomy_field("pet_type", "Pet Type", "target pet such as dog, cat, bird, fish, hamster", "宠物类型"),
  147 + _taxonomy_field("breed_size", "Breed Size", "pet size or breed size when stated, e.g. small breed, large dogs", "体型 / 品种大小"),
  148 + _taxonomy_field("life_stage", "Life Stage", "pet age stage when supported, e.g. puppy, kitten, adult, senior", "成长阶段"),
  149 + _taxonomy_field("material_ingredients", "Material / Ingredients", "main material or ingredient composition when supported", "材质 / 成分"),
  150 + _taxonomy_field("flavor_scent", "Flavor / Scent", "flavor or scent when applicable", "口味 / 气味"),
  151 + _taxonomy_field("key_features", "Key Features", "primary attributes such as interactive, leak-proof, orthopedic, washable, elevated", "关键特征"),
  152 + _taxonomy_field("functional_benefits", "Functional Benefits", "benefits such as dental care, calming, digestion support, joint support", "功能"),
  153 + _taxonomy_field("size_capacity", "Size / Capacity", "size, count, or net content when stated", "尺寸 / 容量"),
  154 + _taxonomy_field("use_scenario", "Use Scenario", "usage such as feeding, training, grooming, travel, indoor play", "使用场景"),
  155 +)
  156 +
  157 +ELECTRONICS_TAXONOMY_FIELDS = (
  158 + _taxonomy_field("product_type", "Product Type", "concise electronics device or component category label", "品类"),
  159 + _taxonomy_field("device_category", "Device Category / Compatibility", "supported platform, component class, or compatible device family when stated", "设备类别 / 兼容性"),
  160 + _taxonomy_field("power_voltage", "Power / Voltage", "power, voltage, wattage, or battery spec when supported", "功率 / 电压"),
  161 + _taxonomy_field("connectivity", "Connectivity", "connection method such as wired, Bluetooth, Wi-Fi, RF, or smart app control", "连接方式"),
  162 + _taxonomy_field("interface_port_type", "Interface / Port Type", "relevant port or interface such as USB-C, AC plug type, HDMI, SATA", "接口 / 端口类型"),
  163 + _taxonomy_field("capacity_storage", "Capacity / Storage", "capacity or storage spec such as 256GB, 2TB, 5000mAh", "容量 / 存储"),
  164 + _taxonomy_field("key_features", "Key Features", "main product features such as touch control, HD display, noise reduction, smart control", "关键特征"),
  165 + _taxonomy_field("material_finish", "Material / Finish", "main housing material or finish when supported", "材质 / 表面处理"),
  166 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  167 + _taxonomy_field("use_case", "Use Case", "intended use such as home entertainment, office, charging, security, repair", "使用场景"),
  168 +)
  169 +
  170 +OUTDOOR_TAXONOMY_FIELDS = (
  171 + _taxonomy_field("product_type", "Product Type", "concise outdoor gear category label", "品类"),
  172 + _taxonomy_field("activity_type", "Activity Type", "primary outdoor activity such as camping, hiking, fishing, climbing, travel", "活动类型"),
  173 + _taxonomy_field("season_weather", "Season / Weather", "season or weather suitability when supported", "适用季节 / 天气"),
  174 + _taxonomy_field("material", "Material", "main material such as aluminum, ripstop nylon, stainless steel, EVA", "材质"),
  175 + _taxonomy_field("capacity_size", "Capacity / Size", "size, length, or capacity when stated", "容量 / 尺寸"),
  176 + _taxonomy_field("protection_resistance", "Protection / Resistance", "resistance or protection such as waterproof, UV resistant, windproof", "防护 / 耐受性"),
  177 + _taxonomy_field("key_features", "Key Features", "primary gear attributes such as foldable, lightweight, insulated, non-slip", "关键特征"),
  178 + _taxonomy_field("portability_packability", "Portability / Packability", "carry or storage trait such as collapsible, compact, ultralight, packable", "便携 / 收纳性"),
  179 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  180 + _taxonomy_field("use_scenario", "Use Scenario", "likely use setting such as campsite, trail, survival kit, beach, picnic", "使用场景"),
  181 +)
  182 +
  183 +HOME_APPLIANCES_TAXONOMY_FIELDS = (
  184 + _taxonomy_field("product_type", "Product Type", "concise home appliance category label", "品类"),
  185 + _taxonomy_field("appliance_category", "Appliance Category", "functional class such as kitchen appliance, cleaning appliance, personal care appliance", "家电类别"),
  186 + _taxonomy_field("power_voltage", "Power / Voltage", "wattage, voltage, plug type, or power supply when supported", "功率 / 电压"),
  187 + _taxonomy_field("capacity_coverage", "Capacity / Coverage", "capacity or coverage metric such as 1.5L, 20L, 40sqm", "容量 / 覆盖范围"),
  188 + _taxonomy_field("control_method", "Control Method", "operation method such as touch, knob, remote, app control", "控制方式"),
  189 + _taxonomy_field("installation_type", "Installation Type", "setup style such as countertop, handheld, portable, wall-mounted, built-in", "安装方式"),
  190 + _taxonomy_field("key_features", "Key Features", "main product features such as timer, steam, HEPA filter, self-cleaning", "关键特征"),
  191 + _taxonomy_field("material_finish", "Material / Finish", "main material or exterior finish when supported", "材质 / 表面处理"),
  192 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  193 + _taxonomy_field("use_scenario", "Use Scenario", "intended use such as cooking, cleaning, grooming, cooling, air treatment", "使用场景"),
  194 +)
  195 +
  196 +HOME_LIVING_TAXONOMY_FIELDS = (
  197 + _taxonomy_field("product_type", "Product Type", "concise home and living category label", "品类"),
  198 + _taxonomy_field("room_placement", "Room / Placement", "intended room or placement such as bedroom, kitchen, bathroom, desktop", "适用空间 / 摆放位置"),
  199 + _taxonomy_field("material", "Material", "main material such as wood, ceramic, cotton, glass, metal", "材质"),
  200 + _taxonomy_field("style", "Style", "home style such as modern, farmhouse, minimalist, boho, Nordic", "风格"),
  201 + _taxonomy_field("size_dimensions", "Size / Dimensions", "size or dimensions when stated", "尺寸 / 规格"),
  202 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  203 + _taxonomy_field("pattern_finish", "Pattern / Finish", "surface pattern or finish such as solid, marble, matte, ribbed", "图案 / 表面处理"),
  204 + _taxonomy_field("key_features", "Key Features", "main product features such as stackable, washable, blackout, space-saving", "关键特征"),
  205 + _taxonomy_field("assembly_installation", "Assembly / Installation", "assembly or installation trait when supported", "组装 / 安装"),
  206 + _taxonomy_field("use_scenario", "Use Scenario", "intended use such as storage, dining, decor, sleep, organization", "使用场景"),
  207 +)
  208 +
  209 +WIGS_TAXONOMY_FIELDS = (
  210 + _taxonomy_field("product_type", "Product Type", "concise wig or hairpiece category label", "品类"),
  211 + _taxonomy_field("hair_material", "Hair Material", "hair material such as human hair, synthetic fiber, heat-resistant fiber", "发丝材质"),
  212 + _taxonomy_field("hair_texture", "Hair Texture", "texture or curl pattern such as straight, body wave, curly, kinky", "发质纹理"),
  213 + _taxonomy_field("hair_length", "Hair Length", "hair length when stated", "发长"),
  214 + _taxonomy_field("hair_color", "Hair Color", "specific hair color or blend when available", "发色"),
  215 + _taxonomy_field("cap_construction", "Cap Construction", "cap type such as full lace, lace front, glueless, U part", "帽网结构"),
  216 + _taxonomy_field("lace_area_part_type", "Lace Area / Part Type", "lace size or part style such as 13x4 lace, middle part, T part", "蕾丝面积 / 分缝类型"),
  217 + _taxonomy_field("density_volume", "Density / Volume", "hair density or fullness when supported", "密度 / 发量"),
  218 + _taxonomy_field("style_bang_type", "Style / Bang Type", "style cue such as bob, pixie, layered, with bangs", "款式 / 刘海类型"),
  219 + _taxonomy_field("occasion_end_use", "Occasion / End Use", "intended use such as daily wear, cosplay, protective style, party", "适用场景"),
  220 +)
  221 +
  222 +BEAUTY_TAXONOMY_FIELDS = (
  223 + _taxonomy_field("product_type", "Product Type", "concise beauty or cosmetics category label", "品类"),
  224 + _taxonomy_field("target_area", "Target Area", "target area such as face, lips, eyes, nails, hair, body", "适用部位"),
  225 + _taxonomy_field("skin_hair_type", "Skin Type / Hair Type", "suitable skin or hair type when supported", "肤质 / 发质"),
  226 + _taxonomy_field("finish_effect", "Finish / Effect", "cosmetic finish or effect such as matte, dewy, volumizing, brightening", "妆效 / 效果"),
  227 + _taxonomy_field("key_ingredients", "Key Ingredients", "notable ingredients when stated", "关键成分"),
  228 + _taxonomy_field("shade_color", "Shade / Color", "specific shade or color when available", "色号 / 颜色"),
  229 + _taxonomy_field("scent", "Scent", "fragrance or scent only when supported", "香味"),
  230 + _taxonomy_field("formulation", "Formulation", "product form such as cream, serum, powder, gel, stick", "剂型 / 形态"),
  231 + _taxonomy_field("functional_benefits", "Functional Benefits", "benefits such as hydration, anti-aging, long-wear, repair, sun protection", "功能"),
  232 + _taxonomy_field("use_scenario", "Use Scenario", "intended use such as daily routine, salon, travel, evening makeup", "使用场景"),
  233 +)
  234 +
  235 +ACCESSORIES_TAXONOMY_FIELDS = (
  236 + _taxonomy_field("product_type", "Product Type", "concise accessory category label such as necklace, watch, belt, hat, or sunglasses", "品类"),
  237 + _taxonomy_field("target_gender", "Target Gender", "intended gender only if clearly implied", "目标性别"),
  238 + _taxonomy_field("material", "Material", "main material such as alloy, leather, stainless steel, acetate, fabric", "材质"),
  239 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  240 + _taxonomy_field("pattern_finish", "Pattern / Finish", "surface treatment or style finish such as polished, textured, braided, rhinestone", "图案 / 表面处理"),
  241 + _taxonomy_field("closure_fastening", "Closure / Fastening", "fastening method when applicable", "闭合 / 固定方式"),
  242 + _taxonomy_field("size_fit", "Size / Fit", "size or fit information such as adjustable, one size, 42mm", "尺寸 / 适配"),
  243 + _taxonomy_field("style", "Style", "style cue such as minimalist, vintage, statement, sporty", "风格"),
  244 + _taxonomy_field("occasion_end_use", "Occasion / End Use", "likely use such as daily wear, formal, party, travel, sun protection", "适用场景"),
  245 + _taxonomy_field("set_pack_size", "Set / Pack Size", "set count or pack size when stated", "套装 / 规格"),
  246 +)
  247 +
  248 +TOYS_TAXONOMY_FIELDS = (
  249 + _taxonomy_field("product_type", "Product Type", "concise toy category label", "品类"),
  250 + _taxonomy_field("age_group", "Age Group", "intended age group when clearly implied", "年龄段"),
  251 + _taxonomy_field("character_theme", "Character / Theme", "licensed character, theme, or play theme when supported", "角色 / 主题"),
  252 + _taxonomy_field("material", "Material", "main toy material such as plush, plastic, wood, silicone", "材质"),
  253 + _taxonomy_field("power_source", "Power Source", "battery, rechargeable, wind-up, or non-powered when supported", "供电方式"),
  254 + _taxonomy_field("interactive_features", "Interactive Features", "interactive functions such as sound, lights, remote control, motion", "互动功能"),
  255 + _taxonomy_field("educational_play_value", "Educational / Play Value", "play value such as STEM, pretend play, sensory, puzzle solving", "教育 / 可玩性"),
  256 + _taxonomy_field("piece_count_size", "Piece Count / Size", "piece count or size when stated", "件数 / 尺寸"),
  257 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  258 + _taxonomy_field("use_scenario", "Use Scenario", "intended use such as indoor play, bath time, party favor, outdoor play", "使用场景"),
  259 +)
  260 +
  261 +SHOES_TAXONOMY_FIELDS = (
  262 + _taxonomy_field("product_type", "Product Type", "concise footwear category label", "品类"),
  263 + _taxonomy_field("target_gender", "Target Gender", "intended gender only if clearly implied", "目标性别"),
  264 + _taxonomy_field("age_group", "Age Group", "only if clearly implied", "年龄段"),
  265 + _taxonomy_field("closure_type", "Closure Type", "fastening method such as lace-up, slip-on, buckle, hook-and-loop", "闭合方式"),
  266 + _taxonomy_field("toe_shape", "Toe Shape", "toe shape when applicable, e.g. round toe, pointed toe, open toe", "鞋头形状"),
  267 + _taxonomy_field("heel_sole_type", "Heel Height / Sole Type", "heel or sole profile such as flat, block heel, wedge, platform, thick sole", "跟高 / 鞋底类型"),
  268 + _taxonomy_field("upper_material", "Upper Material", "main upper material such as leather, knit, canvas, mesh", "鞋面材质"),
  269 + _taxonomy_field("lining_insole_material", "Lining / Insole Material", "lining or insole material when supported", "里料 / 鞋垫材质"),
  270 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  271 + _taxonomy_field("occasion_end_use", "Occasion / End Use", "likely use such as running, casual, office, hiking, formal", "适用场景"),
  272 +)
  273 +
  274 +SPORTS_TAXONOMY_FIELDS = (
  275 + _taxonomy_field("product_type", "Product Type", "concise sports product category label", "品类"),
  276 + _taxonomy_field("sport_activity", "Sport / Activity", "primary sport or activity such as fitness, yoga, basketball, cycling, swimming", "运动 / 活动"),
  277 + _taxonomy_field("skill_level", "Skill Level", "target user level when supported, e.g. beginner, training, professional", "适用水平"),
  278 + _taxonomy_field("material", "Material", "main material such as EVA, carbon fiber, neoprene, latex", "材质"),
  279 + _taxonomy_field("size_capacity", "Size / Capacity", "size, weight, resistance level, or capacity when stated", "尺寸 / 容量"),
  280 + _taxonomy_field("protection_support", "Protection / Support", "support or protection function such as ankle support, shock absorption, impact protection", "防护 / 支撑"),
  281 + _taxonomy_field("key_features", "Key Features", "main features such as anti-slip, adjustable, foldable, quick-dry", "关键特征"),
  282 + _taxonomy_field("power_source", "Power Source", "battery, electric, or non-powered when applicable", "供电方式"),
  283 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  284 + _taxonomy_field("use_scenario", "Use Scenario", "intended use such as gym, home workout, field training, competition", "使用场景"),
  285 +)
  286 +
  287 +OTHERS_TAXONOMY_FIELDS = (
  288 + _taxonomy_field("product_type", "Product Type", "concise product category label, not a full marketing title", "品类"),
  289 + _taxonomy_field("product_category", "Product Category", "broader retail grouping when the specific product type is narrow", "商品类别"),
  290 + _taxonomy_field("target_user", "Target User", "intended user, audience, or recipient when clearly implied", "适用人群"),
  291 + _taxonomy_field("material_ingredients", "Material / Ingredients", "main material or ingredients when supported", "材质 / 成分"),
  292 + _taxonomy_field("key_features", "Key Features", "primary product attributes or standout features", "关键特征"),
  293 + _taxonomy_field("functional_benefits", "Functional Benefits", "practical benefits or performance advantages when supported", "功能"),
  294 + _taxonomy_field("size_capacity", "Size / Capacity", "size, count, weight, or capacity when stated", "尺寸 / 容量"),
  295 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  296 + _taxonomy_field("style_theme", "Style / Theme", "overall style, design theme, or visual direction when supported", "风格 / 主题"),
  297 + _taxonomy_field("use_scenario", "Use Scenario", "likely use occasion or application setting when supported", "使用场景"),
  298 +)
  299 +
  300 +CATEGORY_TAXONOMY_PROFILES: Dict[str, Dict[str, Any]] = {
  301 + "apparel": _make_taxonomy_profile(
  302 + "apparel",
  303 + APPAREL_TAXONOMY_FIELDS,
  304 + ),
  305 + "3c": _make_taxonomy_profile(
  306 + "3C",
  307 + THREE_C_TAXONOMY_FIELDS,
  308 + ),
  309 + "bags": _make_taxonomy_profile(
  310 + "bags",
  311 + BAGS_TAXONOMY_FIELDS,
  312 + ),
  313 + "pet_supplies": _make_taxonomy_profile(
  314 + "pet supplies",
  315 + PET_SUPPLIES_TAXONOMY_FIELDS,
  316 + ),
  317 + "electronics": _make_taxonomy_profile(
  318 + "electronics",
  319 + ELECTRONICS_TAXONOMY_FIELDS,
  320 + ),
  321 + "outdoor": _make_taxonomy_profile(
  322 + "outdoor products",
  323 + OUTDOOR_TAXONOMY_FIELDS,
  324 + ),
  325 + "home_appliances": _make_taxonomy_profile(
  326 + "home appliances",
  327 + HOME_APPLIANCES_TAXONOMY_FIELDS,
  328 + ),
  329 + "home_living": _make_taxonomy_profile(
  330 + "home and living",
  331 + HOME_LIVING_TAXONOMY_FIELDS,
  332 + ),
  333 + "wigs": _make_taxonomy_profile(
  334 + "wigs",
  335 + WIGS_TAXONOMY_FIELDS,
  336 + ),
  337 + "beauty": _make_taxonomy_profile(
  338 + "beauty and cosmetics",
  339 + BEAUTY_TAXONOMY_FIELDS,
  340 + ),
  341 + "accessories": _make_taxonomy_profile(
  342 + "accessories",
  343 + ACCESSORIES_TAXONOMY_FIELDS,
  344 + ),
  345 + "toys": _make_taxonomy_profile(
  346 + "toys",
  347 + TOYS_TAXONOMY_FIELDS,
  348 + ),
  349 + "shoes": _make_taxonomy_profile(
  350 + "shoes",
  351 + SHOES_TAXONOMY_FIELDS,
  352 + ),
  353 + "sports": _make_taxonomy_profile(
  354 + "sports products",
  355 + SPORTS_TAXONOMY_FIELDS,
  356 + ),
  357 + "others": _make_taxonomy_profile(
  358 + "general merchandise",
  359 + OTHERS_TAXONOMY_FIELDS,
  360 + ),
  361 +}
  362 +
  363 +TAXONOMY_SHARED_ANALYSIS_INSTRUCTION = CATEGORY_TAXONOMY_PROFILES["apparel"]["shared_instruction"]
  364 +TAXONOMY_MARKDOWN_TABLE_HEADERS_EN = CATEGORY_TAXONOMY_PROFILES["apparel"]["markdown_table_headers"]["en"]
  365 +TAXONOMY_LANGUAGE_MARKDOWN_TABLE_HEADERS: Dict[str, Dict[str, Any]] = CATEGORY_TAXONOMY_PROFILES["apparel"]["markdown_table_headers"]
  366 +
36 LANGUAGE_MARKDOWN_TABLE_HEADERS: Dict[str, Dict[str, Any]] = { 367 LANGUAGE_MARKDOWN_TABLE_HEADERS: Dict[str, Dict[str, Any]] = {
37 "en": [ 368 "en": [
38 "No.", 369 "No.",
indexer/product_enrich模块说明.md 0 → 100644
@@ -0,0 +1,173 @@ @@ -0,0 +1,173 @@
  1 +# 内容富化模块说明
  2 +
  3 +本文说明商品内容富化模块的职责、入口、输出结构,以及当前 taxonomy profile 的设计约束。
  4 +
  5 +## 1. 模块目标
  6 +
  7 +内容富化模块负责基于商品文本调用 LLM,生成以下索引字段:
  8 +
  9 +- `qanchors`
  10 +- `enriched_tags`
  11 +- `enriched_attributes`
  12 +- `enriched_taxonomy_attributes`
  13 +
  14 +模块追求的设计原则:
  15 +
  16 +- 单一职责:只负责内容理解与结构化输出,不负责 CSV 读写
  17 +- 输出对齐 ES mapping:返回结构可直接写入 `search_products`
  18 +- 配置化扩展:taxonomy profile 通过数据配置扩展,而不是散落条件分支
  19 +- 代码精简:只面向正常使用方式,避免为了不合理调用堆叠补丁逻辑
  20 +
  21 +## 2. 主要文件
  22 +
  23 +- [product_enrich.py](/data/saas-search/indexer/product_enrich.py)
  24 + 运行时主逻辑,负责批处理、缓存、prompt 组装、LLM 调用、markdown 解析、输出整理
  25 +- [product_enrich_prompts.py](/data/saas-search/indexer/product_enrich_prompts.py)
  26 + prompt 模板与 taxonomy profile 配置
  27 +- [document_transformer.py](/data/saas-search/indexer/document_transformer.py)
  28 + 在内部索引构建链路中调用内容富化模块,把结果回填到 ES doc
  29 +- [taxonomy.md](/data/saas-search/indexer/taxonomy.md)
  30 + taxonomy 设计说明与字段清单
  31 +
  32 +## 3. 对外入口
  33 +
  34 +### 3.1 Python 入口
  35 +
  36 +核心入口:
  37 +
  38 +```python
  39 +build_index_content_fields(
  40 + items,
  41 + tenant_id=None,
  42 + enrichment_scopes=None,
  43 + category_taxonomy_profile=None,
  44 +)
  45 +```
  46 +
  47 +输入最小要求:
  48 +
  49 +- `id` 或 `spu_id`
  50 +- `title`
  51 +
  52 +可选输入:
  53 +
  54 +- `brief`
  55 +- `description`
  56 +- `image_url`
  57 +
  58 +关键参数:
  59 +
  60 +- `enrichment_scopes`
  61 + 可选 `generic`、`category_taxonomy`
  62 +- `category_taxonomy_profile`
  63 + taxonomy profile;默认 `apparel`
  64 +
  65 +### 3.2 HTTP 入口
  66 +
  67 +API 路由:
  68 +
  69 +- `POST /indexer/enrich-content`
  70 +
  71 +对应文档:
  72 +
  73 +- [搜索API对接指南-05-索引接口(Indexer)](/data/saas-search/docs/搜索API对接指南-05-索引接口(Indexer).md)
  74 +- [搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation)](/data/saas-search/docs/搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md)
  75 +
  76 +## 4. 输出结构
  77 +
  78 +返回结果与 ES mapping 对齐:
  79 +
  80 +```json
  81 +{
  82 + "id": "223167",
  83 + "qanchors": {
  84 + "zh": ["短袖T恤", "纯棉"],
  85 + "en": ["t-shirt", "cotton"]
  86 + },
  87 + "enriched_tags": {
  88 + "zh": ["短袖", "纯棉"],
  89 + "en": ["short sleeve", "cotton"]
  90 + },
  91 + "enriched_attributes": [
  92 + {
  93 + "name": "enriched_tags",
  94 + "value": {
  95 + "zh": ["短袖", "纯棉"],
  96 + "en": ["short sleeve", "cotton"]
  97 + }
  98 + }
  99 + ],
  100 + "enriched_taxonomy_attributes": [
  101 + {
  102 + "name": "Product Type",
  103 + "value": {
  104 + "zh": ["T恤"],
  105 + "en": ["t-shirt"]
  106 + }
  107 + }
  108 + ]
  109 +}
  110 +```
  111 +
  112 +说明:
  113 +
  114 +- `generic` 部分固定输出核心索引语言 `zh`、`en`
  115 +- `taxonomy` 部分同样统一输出 `zh`、`en`
  116 +
  117 +## 5. Taxonomy profile
  118 +
  119 +当前支持:
  120 +
  121 +- `apparel`
  122 +- `3c`
  123 +- `bags`
  124 +- `pet_supplies`
  125 +- `electronics`
  126 +- `outdoor`
  127 +- `home_appliances`
  128 +- `home_living`
  129 +- `wigs`
  130 +- `beauty`
  131 +- `accessories`
  132 +- `toys`
  133 +- `shoes`
  134 +- `sports`
  135 +- `others`
  136 +
  137 +统一约束:
  138 +
  139 +- 所有 profile 都返回 `zh` + `en`
  140 +- profile 只决定 taxonomy 字段集合,不再决定输出语言
  141 +- 所有 profile 都配置中英文字段名,prompt/header 结构保持一致
  142 +
  143 +## 6. 内部索引链路的当前约束
  144 +
  145 +在内部 ES 文档构建链路里,`document_transformer` 当前调用内容富化时,taxonomy profile 暂时固定传:
  146 +
  147 +```python
  148 +category_taxonomy_profile="apparel"
  149 +```
  150 +
  151 +这是一种显式、可控、代码更干净的临时策略。
  152 +
  153 +当前代码里已保留 TODO:
  154 +
  155 +- 后续从数据库读取租户真实所属行业
  156 +- 再用该行业替换固定的 `apparel`
  157 +
  158 +当前不做“根据商品类目文本自动猜 profile”的隐式逻辑,避免增加冗余代码与不必要的不确定性。
  159 +
  160 +## 7. 缓存与批处理
  161 +
  162 +缓存键由以下信息共同决定:
  163 +
  164 +- `analysis_kind`
  165 +- `target_lang`
  166 +- prompt/schema 版本指纹
  167 +- prompt 实际输入文本
  168 +
  169 +批处理规则:
  170 +
  171 +- 单次 LLM 调用最多 20 条
  172 +- 上层允许传更大批次,模块内部自动拆批
  173 +- uncached batch 可并发执行
indexer/taxonomy.md 0 → 100644
@@ -0,0 +1,196 @@ @@ -0,0 +1,196 @@
  1 +
  2 +# Cross-Border E-commerce Core Categories 大类
  3 +
  4 +## 1. 3C
  5 +Phone accessories, computer peripherals, smart wearables, audio & video, smart home, gaming gear. 手机配件、电脑周边、智能穿戴、影音娱乐、智能家居、游戏设备。
  6 +
  7 +## 2. Bags 包
  8 +Handbags, backpacks, wallets, luggage, crossbody bags, tote bags. 手提包、双肩包、钱包、行李箱、斜挎包、托特包。
  9 +
  10 +## 3. Pet Supplies 宠物用品
  11 +Pet food, pet toys, pet care products, pet grooming, pet clothing, smart pet devices. 宠物食品、宠物玩具、宠物护理用品、宠物美容、宠物服装、智能宠物设备。
  12 +
  13 +## 4. Electronics 电子产品
  14 +Consumer electronics, home appliances, digital devices, cables & chargers, batteries, electronic components. 消费电子产品、家用电器、数码设备、线材充电器、电池、电子元器件。
  15 +
  16 +## 5. Clothing 服装
  17 +Women's wear, men's wear, kid's wear, underwear, outerwear, activewear. 女装、男装、童装、内衣、外套、运动服装。
  18 +
  19 +## 6. Outdoor 户外用品
  20 +Camping gear, hiking equipment, fishing supplies, outdoor clothing, travel accessories, survival tools. 露营装备、徒步用品、渔具、户外服装、旅行配件、求生工具。
  21 +
  22 +## 7. Home Appliances 家电/电器
  23 +Kitchen appliances, cleaning appliances, personal care appliances, heating & cooling, smart home devices. 厨房电器、清洁电器、个护电器、冷暖设备、智能家居设备。
  24 +
  25 +## 8. Home & Living 家居
  26 +Furniture, home textiles, lighting, kitchenware, storage, home decor. 家具、家纺、灯具、厨具、收纳、家居装饰。
  27 +
  28 +## 9. Wigs 假发
  29 +
  30 +## 10. Beauty & Cosmetics 美容美妆
  31 +Skincare, makeup, nail care, beauty tools, hair care, fragrances. 护肤品、彩妆、美甲、美容工具、护发、香水。
  32 +
  33 +## 11. Accessories 配饰
  34 +Jewelry, watches, belts, scarves, hats, sunglasses, hair accessories. 珠宝、手表、腰带、围巾、帽子、太阳镜、发饰。
  35 +
  36 +## 12. Toys 玩具
  37 +Educational toys, plush toys, action figures, puzzles, outdoor toys, DIY toys. 益智玩具、毛绒玩具、可动人偶、拼图、户外玩具、DIY玩具。
  38 +
  39 +## 13. Shoes 鞋子
  40 +Sneakers, boots, sandals, heels, flats, sports shoes. 运动鞋、靴子、凉鞋、高跟鞋、平底鞋、球鞋。
  41 +
  42 +## 14. Sports 运动产品
  43 +Fitness equipment, sports gear, team sports, racquet sports, water sports, cycling. 健身器材、运动装备、团队运动、球拍运动、水上运动、骑行。
  44 +
  45 +## 15. Others 其他
  46 +
  47 +# 各个大类的taxonomy
  48 +## 1. Clothing & Apparel 服装
  49 +
  50 +### A. Product Classification
  51 +
  52 +| 一级层级 | 中文列名 | English Column Name |
  53 +| ------------------------- | ---- | ------------------- |
  54 +| A. Product Classification | 品类 | Product Type |
  55 +| A. Product Classification | 目标性别 | Target Gender |
  56 +| A. Product Classification | 年龄段 | Age Group |
  57 +| A. Product Classification | 适用季节 | Season |
  58 +
  59 +### B. Garment Design
  60 +
  61 +| 一级层级 | 中文列名 | English Column Name |
  62 +| ----------------- | ---- | ------------------- |
  63 +| B. Garment Design | 版型 | Fit |
  64 +| B. Garment Design | 廓形 | Silhouette |
  65 +| B. Garment Design | 领型 | Neckline |
  66 +| B. Garment Design | 袖型 | Sleeve Style |
  67 +| B. Garment Design | 肩带设计 | Strap Type |
  68 +| B. Garment Design | 腰型 | Rise / Waistline |
  69 +| B. Garment Design | 裤型 | Leg Shape |
  70 +| B. Garment Design | 裙型 | Skirt Shape |
  71 +| B. Garment Design | 长度 | Length Type |
  72 +| B. Garment Design | 闭合方式 | Closure Type |
  73 +| B. Garment Design | 设计细节 | Design Details |
  74 +
  75 +### C. Material & Performance
  76 +
  77 +| 一级层级 | 中文列名 | English Column Name |
  78 +| ------------------------- | ----------- | -------------------- |
  79 +| C. Material & Performance | 面料 | Fabric |
  80 +| C. Material & Performance | 成分 | Material Composition |
  81 +| C. Material & Performance | 面料特性 | Fabric Properties |
  82 +| C. Material & Performance | 服装特征 / 功能细节 | Clothing Features |
  83 +| C. Material & Performance | 功能 | Functional Benefits |
  84 +
  85 +### D. Merchandising Attributes
  86 +
  87 +| 一级层级 | 中文列名 | English Column Name |
  88 +| --------------------------- | ------- | ------------------- |
  89 +| D. Merchandising Attributes | 主颜色 | Color |
  90 +| D. Merchandising Attributes | 色系 | Color Family |
  91 +| D. Merchandising Attributes | 印花 / 图案 | Print / Pattern |
  92 +| D. Merchandising Attributes | 适用场景 | Occasion / End Use |
  93 +| D. Merchandising Attributes | 风格 | Style Aesthetic |
  94 +
  95 +
  96 +
  97 +根据这个产生
  98 +enriched_taxonomy_attributes
  99 +
  100 +```python
  101 +Product Type
  102 +Target Gender
  103 +Age Group
  104 +Season
  105 +Fit
  106 +Silhouette
  107 +Neckline
  108 +Sleeve Length Type
  109 +Sleeve Style
  110 +Strap Type
  111 +Rise / Waistline
  112 +Leg Shape
  113 +Skirt Shape
  114 +Length Type
  115 +Closure Type
  116 +Design Details
  117 +Fabric
  118 +Material Composition
  119 +Fabric Properties
  120 +Clothing Features
  121 +Functional Benefits
  122 +Color
  123 +Color Family
  124 +Print / Pattern
  125 +Occasion / End Use
  126 +Style Aesthetic
  127 +```
  128 +
  129 +提示词:
  130 +
  131 +```python
  132 +SHARED_ANALYSIS_INSTRUCTION = """
  133 +Analyze each input product text and fill the columns below using an apparel attribute taxonomy.
  134 +
  135 +Output columns:
  136 +1. Product Type: concise ecommerce apparel category label, not a full marketing title
  137 +2. Target Gender: intended gender only if clearly implied
  138 +3. Age Group: only if clearly implied, e.g. adults, kids, teens, toddlers, babies
  139 +4. Season: season(s) or all-season suitability only if supported
  140 +5. Fit: body closeness, e.g. slim, regular, relaxed, oversized, fitted
  141 +6. Silhouette: overall garment shape, e.g. straight, A-line, boxy, tapered, bodycon, wide-leg
  142 +7. Neckline: neckline type when applicable, e.g. crew neck, V-neck, hooded, collared, square neck
  143 +8. Sleeve Length Type: sleeve length only, e.g. sleeveless, short sleeve, long sleeve, three-quarter sleeve
  144 +9. Sleeve Style: sleeve design only, e.g. puff sleeve, raglan sleeve, batwing sleeve, bell sleeve
  145 +10. Strap Type: strap design when applicable, e.g. spaghetti strap, wide strap, halter strap, adjustable strap
  146 +11. Rise / Waistline: waist placement when applicable, e.g. high rise, mid rise, low rise, empire waist
  147 +12. Leg Shape: for bottoms only, e.g. straight leg, wide leg, flare leg, tapered leg, skinny leg
  148 +13. Skirt Shape: for skirts only, e.g. A-line, pleated, pencil, mermaid
  149 +14. Length Type: design length only, not size, e.g. cropped, regular, longline, mini, midi, maxi, ankle length, full length
  150 +15. Closure Type: fastening method when applicable, e.g. zipper, button, drawstring, elastic waist, hook-and-loop
  151 +16. Design Details: construction or visual details, e.g. ruched, ruffled, pleated, cut-out, layered, distressed, split hem
  152 +17. Fabric: fabric type only, e.g. denim, knit, chiffon, jersey, fleece, cotton twill
  153 +18. Material Composition: fiber content or blend only if stated, e.g. cotton, polyester, spandex, linen blend, 95% cotton 5% elastane
  154 +19. Fabric Properties: inherent fabric traits, e.g. stretch, breathable, lightweight, soft-touch, water-resistant
  155 +20. Clothing Features: product features, e.g. lined, reversible, hooded, packable, padded, pocketed
  156 +21. Functional Benefits: wearer benefits, e.g. moisture-wicking, thermal insulation, UV protection, easy care, supportive compression
  157 +22. Color: specific color name when available
  158 +23. Color Family: normalized broad retail color group, e.g. black, white, blue, green, red, pink, beige, brown, gray
  159 +24. Print / Pattern: surface pattern when applicable, e.g. solid, striped, plaid, floral, graphic, animal print
  160 +25. Occasion / End Use: likely use occasion only if supported, e.g. office, casual wear, streetwear, lounge, workout, outdoor
  161 +26. Style Aesthetic: overall style only if supported, e.g. minimalist, streetwear, athleisure, smart casual, romantic, playful
  162 +
  163 +Rules:
  164 +- Keep the same row order and row count as input.
  165 +- Infer only from the provided product text.
  166 +- Leave blank if not applicable or not reasonably supported.
  167 +- Use concise, standardized English ecommerce wording.
  168 +- Do not combine different attribute dimensions in one field.
  169 +- If multiple values are needed, use the delimiter required by the localization setting.
  170 +
  171 +Input product list:
  172 +"""
  173 +```
  174 +
  175 +## 2. Other taxonomy profiles
  176 +
  177 +说明:
  178 +- 所有 profile 统一返回 `zh` + `en`。
  179 +- 代码中的 profile slug 与下面保持一致。
  180 +
  181 +| Profile | Core columns (`en`) |
  182 +| --- | --- |
  183 +| `3c` | Product Type, Compatible Device / Model, Connectivity, Interface / Port Type, Power Source / Charging, Key Features, Material / Finish, Color, Pack Size, Use Case |
  184 +| `bags` | Product Type, Target Gender, Carry Style, Size / Capacity, Material, Closure Type, Structure / Compartments, Strap / Handle Type, Color, Occasion / End Use |
  185 +| `pet_supplies` | Product Type, Pet Type, Breed Size, Life Stage, Material / Ingredients, Flavor / Scent, Key Features, Functional Benefits, Size / Capacity, Use Scenario |
  186 +| `electronics` | Product Type, Device Category / Compatibility, Power / Voltage, Connectivity, Interface / Port Type, Capacity / Storage, Key Features, Material / Finish, Color, Use Case |
  187 +| `outdoor` | Product Type, Activity Type, Season / Weather, Material, Capacity / Size, Protection / Resistance, Key Features, Portability / Packability, Color, Use Scenario |
  188 +| `home_appliances` | Product Type, Appliance Category, Power / Voltage, Capacity / Coverage, Control Method, Installation Type, Key Features, Material / Finish, Color, Use Scenario |
  189 +| `home_living` | Product Type, Room / Placement, Material, Style, Size / Dimensions, Color, Pattern / Finish, Key Features, Assembly / Installation, Use Scenario |
  190 +| `wigs` | Product Type, Hair Material, Hair Texture, Hair Length, Hair Color, Cap Construction, Lace Area / Part Type, Density / Volume, Style / Bang Type, Occasion / End Use |
  191 +| `beauty` | Product Type, Target Area, Skin Type / Hair Type, Finish / Effect, Key Ingredients, Shade / Color, Scent, Formulation, Functional Benefits, Use Scenario |
  192 +| `accessories` | Product Type, Target Gender, Material, Color, Pattern / Finish, Closure / Fastening, Size / Fit, Style, Occasion / End Use, Set / Pack Size |
  193 +| `toys` | Product Type, Age Group, Character / Theme, Material, Power Source, Interactive Features, Educational / Play Value, Piece Count / Size, Color, Use Scenario |
  194 +| `shoes` | Product Type, Target Gender, Age Group, Closure Type, Toe Shape, Heel Height / Sole Type, Upper Material, Lining / Insole Material, Color, Occasion / End Use |
  195 +| `sports` | Product Type, Sport / Activity, Skill Level, Material, Size / Capacity, Protection / Support, Key Features, Power Source, Color, Use Scenario |
  196 +| `others` | Product Type, Product Category, Target User, Material / Ingredients, Key Features, Functional Benefits, Size / Capacity, Color, Style / Theme, Use Scenario |
mappings/README.md
@@ -68,6 +68,7 @@ @@ -68,6 +68,7 @@
68 - `option2_values` 68 - `option2_values`
69 - `option3_values` 69 - `option3_values`
70 - `enriched_attributes.value` 70 - `enriched_attributes.value`
  71 +- `enriched_taxonomy_attributes.value`
71 - `specifications.value_text` 72 - `specifications.value_text`
72 73
73 以 `category_path` 和 `option*_values` 为例,核心语言灌入结果应至少包含: 74 以 `category_path` 和 `option*_values` 为例,核心语言灌入结果应至少包含:
mappings/generate_search_products_mapping.py
@@ -214,6 +214,11 @@ FIELD_SPECS = [ @@ -214,6 +214,11 @@ FIELD_SPECS = [
214 scalar_field("name", "keyword"), 214 scalar_field("name", "keyword"),
215 text_field("value", "core_language_text_with_keyword"), 215 text_field("value", "core_language_text_with_keyword"),
216 ), 216 ),
  217 + nested_field(
  218 + "enriched_taxonomy_attributes",
  219 + scalar_field("name", "keyword"),
  220 + text_field("value", "core_language_text_with_keyword"),
  221 + ),
217 scalar_field("option1_name", "keyword"), 222 scalar_field("option1_name", "keyword"),
218 scalar_field("option2_name", "keyword"), 223 scalar_field("option2_name", "keyword"),
219 scalar_field("option3_name", "keyword"), 224 scalar_field("option3_name", "keyword"),
mappings/search_products.json
@@ -2116,6 +2116,40 @@ @@ -2116,6 +2116,40 @@
2116 } 2116 }
2117 } 2117 }
2118 }, 2118 },
  2119 + "enriched_taxonomy_attributes": {
  2120 + "type": "nested",
  2121 + "properties": {
  2122 + "name": {
  2123 + "type": "keyword"
  2124 + },
  2125 + "value": {
  2126 + "type": "object",
  2127 + "properties": {
  2128 + "zh": {
  2129 + "type": "text",
  2130 + "analyzer": "index_ik",
  2131 + "search_analyzer": "query_ik",
  2132 + "fields": {
  2133 + "keyword": {
  2134 + "type": "keyword",
  2135 + "normalizer": "lowercase"
  2136 + }
  2137 + }
  2138 + },
  2139 + "en": {
  2140 + "type": "text",
  2141 + "analyzer": "english",
  2142 + "fields": {
  2143 + "keyword": {
  2144 + "type": "keyword",
  2145 + "normalizer": "lowercase"
  2146 + }
  2147 + }
  2148 + }
  2149 + }
  2150 + }
  2151 + }
  2152 + },
2119 "option1_name": { 2153 "option1_name": {
2120 "type": "keyword" 2154 "type": "keyword"
2121 }, 2155 },
perf_reports/20260311/reranker_1000docs/report.md
@@ -34,5 +34,5 @@ Workload profile: @@ -34,5 +34,5 @@ Workload profile:
34 ## Reproduce 34 ## Reproduce
35 35
36 ```bash 36 ```bash
37 -./scripts/benchmark_reranker_1000docs.sh 37 +./benchmarks/reranker/benchmark_reranker_1000docs.sh
38 ``` 38 ```
perf_reports/20260317/translation_local_models/README.md
1 # Local Translation Model Benchmark Report 1 # Local Translation Model Benchmark Report
2 2
3 -Test script: [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py) 3 +Test script: [`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py)
4 4
5 Test time: `2026-03-17` 5 Test time: `2026-03-17`
6 6
@@ -67,7 +67,7 @@ To model online search query translation, we reran NLLB with `batch_size=1`. In @@ -67,7 +67,7 @@ To model online search query translation, we reran NLLB with `batch_size=1`. In
67 Command used: 67 Command used:
68 68
69 ```bash 69 ```bash
70 -./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \ 70 +./.venv-translator/bin/python benchmarks/translation/benchmark_translation_local_models.py \
71 --single \ 71 --single \
72 --model nllb-200-distilled-600m \ 72 --model nllb-200-distilled-600m \
73 --source-lang zh \ 73 --source-lang zh \
perf_reports/20260318/nllb_t4_product_names_ct2/README.md
1 # NLLB T4 Product-Name Tuning Summary 1 # NLLB T4 Product-Name Tuning Summary
2 2
3 测试脚本: 3 测试脚本:
4 -- [`scripts/benchmark_nllb_t4_tuning.py`](/data/saas-search/scripts/benchmark_nllb_t4_tuning.py) 4 +- [`benchmarks/translation/benchmark_nllb_t4_tuning.py`](/data/saas-search/benchmarks/translation/benchmark_nllb_t4_tuning.py)
5 5
6 本轮报告: 6 本轮报告:
7 - Markdown:[`nllb_t4_tuning_003608.md`](/data/saas-search/perf_reports/20260318/nllb_t4_product_names_ct2/nllb_t4_tuning_003608.md) 7 - Markdown:[`nllb_t4_tuning_003608.md`](/data/saas-search/perf_reports/20260318/nllb_t4_product_names_ct2/nllb_t4_tuning_003608.md)
perf_reports/20260318/translation_local_models/README.md
1 # Local Translation Model Benchmark Report 1 # Local Translation Model Benchmark Report
2 2
3 测试脚本: 3 测试脚本:
4 -- [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py) 4 +- [`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py)
5 5
6 完整结果: 6 完整结果:
7 - Markdown:[`translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md) 7 - Markdown:[`translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md)
@@ -39,7 +39,7 @@ @@ -39,7 +39,7 @@
39 39
40 ```bash 40 ```bash
41 cd /data/saas-search 41 cd /data/saas-search
42 -./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \ 42 +./.venv-translator/bin/python benchmarks/translation/benchmark_translation_local_models.py \
43 --suite extended \ 43 --suite extended \
44 --disable-cache \ 44 --disable-cache \
45 --serial-items-per-case 256 \ 45 --serial-items-per-case 256 \
perf_reports/20260318/translation_local_models_ct2/README.md
1 # Local Translation Model Benchmark Report (CTranslate2) 1 # Local Translation Model Benchmark Report (CTranslate2)
2 2
3 测试脚本: 3 测试脚本:
4 -- [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py) 4 +- [`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py)
5 5
6 本轮 CT2 结果: 6 本轮 CT2 结果:
7 - Markdown:[`translation_local_models_ct2_extended_233253.md`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2/translation_local_models_ct2_extended_233253.md) 7 - Markdown:[`translation_local_models_ct2_extended_233253.md`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2/translation_local_models_ct2_extended_233253.md)
@@ -46,7 +46,7 @@ from datetime import datetime @@ -46,7 +46,7 @@ from datetime import datetime
46 from pathlib import Path 46 from pathlib import Path
47 from types import SimpleNamespace 47 from types import SimpleNamespace
48 48
49 -from scripts.benchmark_translation_local_models import ( 49 +from benchmarks.translation.benchmark_translation_local_models import (
50 SCENARIOS, 50 SCENARIOS,
51 benchmark_extended_scenario, 51 benchmark_extended_scenario,
52 build_environment_info, 52 build_environment_info,
perf_reports/20260318/translation_local_models_ct2_focus/README.md
1 # Local Translation Model Focused T4 Tuning 1 # Local Translation Model Focused T4 Tuning
2 2
3 测试脚本: 3 测试脚本:
4 -- [`scripts/benchmark_translation_local_models_focus.py`](/data/saas-search/scripts/benchmark_translation_local_models_focus.py) 4 +- [`benchmarks/translation/benchmark_translation_local_models_focus.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models_focus.py)
5 5
6 本轮聚焦结果: 6 本轮聚焦结果:
7 - Markdown:[`translation_local_models_focus_235018.md`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2_focus/translation_local_models_focus_235018.md) 7 - Markdown:[`translation_local_models_focus_235018.md`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2_focus/translation_local_models_focus_235018.md)
perf_reports/README.md
@@ -4,7 +4,7 @@ @@ -4,7 +4,7 @@
4 4
5 | 脚本 | 用途 | 5 | 脚本 | 用途 |
6 |------|------| 6 |------|------|
7 -| `scripts/perf_api_benchmark.py` | 搜索后端、向量、翻译、重排等 HTTP 接口压测;支持 `--embed-text-priority` / `--embed-image-priority` 与 `scripts/perf_cases.json.example` | 7 +| `benchmarks/perf_api_benchmark.py` | 搜索后端、向量、翻译、重排等 HTTP 接口压测;支持 `--embed-text-priority` / `--embed-image-priority` 与 `benchmarks/perf_cases.json.example` |
8 8
9 历史矩阵示例(并发扫描): 9 历史矩阵示例(并发扫描):
10 10
@@ -25,10 +25,10 @@ @@ -25,10 +25,10 @@
25 25
26 ```bash 26 ```bash
27 source activate.sh 27 source activate.sh
28 -python scripts/perf_api_benchmark.py --scenario embed_text --duration 8 --concurrency 10 --timeout 30 --output perf_reports/2026-03-20_embed_text_p0.json  
29 -python scripts/perf_api_benchmark.py --scenario embed_text --duration 8 --concurrency 10 --embed-text-priority 1 --output perf_reports/2026-03-20_embed_text_p1.json  
30 -python scripts/perf_api_benchmark.py --scenario embed_image --duration 8 --concurrency 5 --timeout 60 --output perf_reports/2026-03-20_embed_image_p0.json  
31 -python scripts/perf_api_benchmark.py --scenario embed_image --duration 8 --concurrency 5 --embed-image-priority 1 --output perf_reports/2026-03-20_embed_image_p1.json 28 +python benchmarks/perf_api_benchmark.py --scenario embed_text --duration 8 --concurrency 10 --timeout 30 --output perf_reports/2026-03-20_embed_text_p0.json
  29 +python benchmarks/perf_api_benchmark.py --scenario embed_text --duration 8 --concurrency 10 --embed-text-priority 1 --output perf_reports/2026-03-20_embed_text_p1.json
  30 +python benchmarks/perf_api_benchmark.py --scenario embed_image --duration 8 --concurrency 5 --timeout 60 --output perf_reports/2026-03-20_embed_image_p0.json
  31 +python benchmarks/perf_api_benchmark.py --scenario embed_image --duration 8 --concurrency 5 --embed-image-priority 1 --output perf_reports/2026-03-20_embed_image_p1.json
32 ``` 32 ```
33 33
34 说明:本次为 **8 秒 smoke**,与 `2026-03-12` 矩阵的时长/并发不可直接横向对比;仅验证 `priority` 参数下服务仍返回 200 且 payload 校验通过。 34 说明:本次为 **8 秒 smoke**,与 `2026-03-12` 矩阵的时长/并发不可直接横向对比;仅验证 `priority` 参数下服务仍返回 200 且 payload 校验通过。
perf_reports/reranker_vllm_instruction/2026-03-25/RESULTS.md
@@ -25,7 +25,7 @@ Shared across both backends for this run: @@ -25,7 +25,7 @@ Shared across both backends for this run:
25 25
26 ## Methodology 26 ## Methodology
27 27
28 -- Script: `python scripts/benchmark_reranker_random_titles.py 100,200,400,600,800,1000 --repeat 5` with **`--seed 99`** (see note below), **`--quiet-runs`**, **`--timeout 360`**. 28 +- Script: `python benchmarks/reranker/benchmark_reranker_random_titles.py 100,200,400,600,800,1000 --repeat 5` with **`--seed 99`** (see note below), **`--quiet-runs`**, **`--timeout 360`**.
29 - Titles: default file `/home/ubuntu/rerank_test/titles.1.8w` (one title per line). 29 - Titles: default file `/home/ubuntu/rerank_test/titles.1.8w` (one title per line).
30 - Query: default `健身女生T恤短袖`. 30 - Query: default `健身女生T恤短袖`.
31 - Each scenario: **3 warm-up** requests at `n=400` (not timed), then **5 timed** runs per `n`. 31 - Each scenario: **3 warm-up** requests at `n=400` (not timed), then **5 timed** runs per `n`.
@@ -56,9 +56,9 @@ JSON aggregates (means, stdev, raw `values_ms`): same directory, `qwen3_vllm_{co @@ -56,9 +56,9 @@ JSON aggregates (means, stdev, raw `values_ms`): same directory, `qwen3_vllm_{co
56 ## Tooling added / changed 56 ## Tooling added / changed
57 57
58 - `reranker/server.py`: `/health` includes `instruction_format` when the active backend sets `_instruction_format`. 58 - `reranker/server.py`: `/health` includes `instruction_format` when the active backend sets `_instruction_format`.
59 -- `scripts/benchmark_reranker_random_titles.py`: `--tag`, `--json-summary-out`, `--quiet-runs`.  
60 -- `scripts/patch_rerank_vllm_benchmark_config.py`: surgical YAML patch (preserves newlines).  
61 -- `scripts/run_reranker_vllm_instruction_benchmark.sh`: full matrix driver (continues if a benchmark exits non-zero; uses `--timeout 360`). 59 +- `benchmarks/reranker/benchmark_reranker_random_titles.py`: `--tag`, `--json-summary-out`, `--quiet-runs`.
  60 +- `benchmarks/reranker/patch_rerank_vllm_benchmark_config.py`: surgical YAML patch (preserves newlines).
  61 +- `benchmarks/reranker/run_reranker_vllm_instruction_benchmark.sh`: full matrix driver (continues if a benchmark exits non-zero; uses `--timeout 360`).
62 62
63 --- 63 ---
64 64
@@ -73,7 +73,7 @@ JSON aggregates (means, stdev, raw `values_ms`): same directory, `qwen3_vllm_{co @@ -73,7 +73,7 @@ JSON aggregates (means, stdev, raw `values_ms`): same directory, `qwen3_vllm_{co
73 | Attention | Backend forced / steered attention on T4 (e.g. `TRITON_ATTN` path) | **No** `attention_config` in `LLM(...)`; vLLM **auto** — on this T4 run, logs show **`FLASHINFER`** | 73 | Attention | Backend forced / steered attention on T4 (e.g. `TRITON_ATTN` path) | **No** `attention_config` in `LLM(...)`; vLLM **auto** — on this T4 run, logs show **`FLASHINFER`** |
74 | Config surface | `vllm_attention_backend` / `RERANK_VLLM_ATTENTION_BACKEND` 等 | **Removed**(少 YAML/环境变量分支,逻辑收敛) | 74 | Config surface | `vllm_attention_backend` / `RERANK_VLLM_ATTENTION_BACKEND` 等 | **Removed**(少 YAML/环境变量分支,逻辑收敛) |
75 | Code default `instruction_format` | `qwen3_vllm_score` 默认 `standard` | 与 `qwen3_vllm` 对齐为 **`compact`**(仍可在 YAML 写 `standard`) | 75 | Code default `instruction_format` | `qwen3_vllm_score` 默认 `standard` | 与 `qwen3_vllm` 对齐为 **`compact`**(仍可在 YAML 写 `standard`) |
76 -| Smoke / 启动 | — | `scripts/smoke_qwen3_vllm_score_backend.py`;`scripts/start_reranker.sh` 将 **venv `bin` 置于 `PATH`**(FLASHINFER JIT 依赖 venv 内的 `ninja`) | 76 +| Smoke / 启动 | — | `benchmarks/reranker/smoke_qwen3_vllm_score_backend.py`;`scripts/start_reranker.sh` 将 **venv `bin` 置于 `PATH`**(FLASHINFER JIT 依赖 venv 内的 `ninja`) |
77 77
78 Micro-benchmark (same machine, isolated): **~927.5 ms → ~673.1 ms** at **n=400** docs on `LLM.score()` steady state (~**28%**), after removing the forced attention path and letting vLLM pick **FLASHINFER**. 78 Micro-benchmark (same machine, isolated): **~927.5 ms → ~673.1 ms** at **n=400** docs on `LLM.score()` steady state (~**28%**), after removing the forced attention path and letting vLLM pick **FLASHINFER**.
79 79
requirements_translator_service.txt
@@ -13,7 +13,8 @@ httpx&gt;=0.24.0 @@ -13,7 +13,8 @@ httpx&gt;=0.24.0
13 tqdm>=4.65.0 13 tqdm>=4.65.0
14 14
15 torch>=2.0.0 15 torch>=2.0.0
16 -transformers>=4.30.0 16 +# Keep translator conversions on the last verified NLLB-compatible release line.
  17 +transformers>=4.51.0,<4.52.0
17 ctranslate2>=4.7.0 18 ctranslate2>=4.7.0
18 sentencepiece>=0.2.0 19 sentencepiece>=0.2.0
19 sacremoses>=0.1.1 20 sacremoses>=0.1.1
reranker/DEPLOYMENT_AND_TUNING.md
@@ -109,7 +109,7 @@ curl -sS http://127.0.0.1:6007/health @@ -109,7 +109,7 @@ curl -sS http://127.0.0.1:6007/health
109 ### 5.1 使用一键压测脚本 109 ### 5.1 使用一键压测脚本
110 110
111 ```bash 111 ```bash
112 -./scripts/benchmark_reranker_1000docs.sh 112 +./benchmarks/reranker/benchmark_reranker_1000docs.sh
113 ``` 113 ```
114 114
115 输出目录: 115 输出目录:
reranker/GGUF_0_6B_INSTALL_AND_TUNING.md
@@ -144,7 +144,7 @@ qwen3_gguf_06b: @@ -144,7 +144,7 @@ qwen3_gguf_06b:
144 144
145 ```bash 145 ```bash
146 PYTHONPATH=/data/saas-search ./.venv-reranker-gguf/bin/python \ 146 PYTHONPATH=/data/saas-search ./.venv-reranker-gguf/bin/python \
147 - scripts/benchmark_reranker_gguf_local.py --backend-name qwen3_gguf_06b --docs 400 147 + benchmarks/reranker/benchmark_reranker_gguf_local.py --backend-name qwen3_gguf_06b --docs 400
148 ``` 148 ```
149 149
150 按服务方式启动: 150 按服务方式启动:
reranker/GGUF_INSTALL_AND_TUNING.md
@@ -117,7 +117,7 @@ HF_HUB_DISABLE_XET=1 @@ -117,7 +117,7 @@ HF_HUB_DISABLE_XET=1
117 117
118 ```bash 118 ```bash
119 PYTHONPATH=/data/saas-search ./.venv-reranker-gguf/bin/python \ 119 PYTHONPATH=/data/saas-search ./.venv-reranker-gguf/bin/python \
120 - scripts/benchmark_reranker_gguf_local.py --docs 64 --repeat 1 120 + benchmarks/reranker/benchmark_reranker_gguf_local.py --docs 64 --repeat 1
121 ``` 121 ```
122 122
123 它会直接实例化 GGUF backend,输出: 123 它会直接实例化 GGUF backend,输出:
@@ -134,7 +134,7 @@ PYTHONPATH=/data/saas-search ./.venv-reranker-gguf/bin/python \ @@ -134,7 +134,7 @@ PYTHONPATH=/data/saas-search ./.venv-reranker-gguf/bin/python \
134 134
135 - Query: `白色oversized T-shirt` 135 - Query: `白色oversized T-shirt`
136 - Docs: `64` 条商品标题 136 - Docs: `64` 条商品标题
137 -- 本地脚本:`scripts/benchmark_reranker_gguf_local.py` 137 +- 本地脚本:`benchmarks/reranker/benchmark_reranker_gguf_local.py`
138 - 每组 1 次,重点比较相对趋势 138 - 每组 1 次,重点比较相对趋势
139 139
140 结果: 140 结果:
@@ -195,7 +195,7 @@ n_gpu_layers=999 @@ -195,7 +195,7 @@ n_gpu_layers=999
195 195
196 ```bash 196 ```bash
197 RERANK_BASE=http://127.0.0.1:6007 \ 197 RERANK_BASE=http://127.0.0.1:6007 \
198 - ./.venv/bin/python scripts/benchmark_reranker_random_titles.py 64 --repeat 1 --query '白色oversized T-shirt' 198 + ./.venv/bin/python benchmarks/reranker/benchmark_reranker_random_titles.py 64 --repeat 1 --query '白色oversized T-shirt'
199 ``` 199 ```
200 200
201 得到: 201 得到:
@@ -206,7 +206,7 @@ RERANK_BASE=http://127.0.0.1:6007 \ @@ -206,7 +206,7 @@ RERANK_BASE=http://127.0.0.1:6007 \
206 206
207 ```bash 207 ```bash
208 RERANK_BASE=http://127.0.0.1:6007 \ 208 RERANK_BASE=http://127.0.0.1:6007 \
209 - ./.venv/bin/python scripts/benchmark_reranker_random_titles.py 153 --repeat 1 --query '白色oversized T-shirt' 209 + ./.venv/bin/python benchmarks/reranker/benchmark_reranker_random_titles.py 153 --repeat 1 --query '白色oversized T-shirt'
210 ``` 210 ```
211 211
212 得到: 212 得到:
@@ -276,5 +276,5 @@ offload_kqv: true @@ -276,5 +276,5 @@ offload_kqv: true
276 - `config/config.yaml` 276 - `config/config.yaml`
277 - `scripts/setup_reranker_venv.sh` 277 - `scripts/setup_reranker_venv.sh`
278 - `scripts/start_reranker.sh` 278 - `scripts/start_reranker.sh`
279 -- `scripts/benchmark_reranker_gguf_local.py` 279 +- `benchmarks/reranker/benchmark_reranker_gguf_local.py`
280 - `reranker/GGUF_INSTALL_AND_TUNING.md` 280 - `reranker/GGUF_INSTALL_AND_TUNING.md`
reranker/README.md
@@ -46,9 +46,9 @@ Reranker 服务提供统一的 `/rerank` API,支持可插拔后端(BGE、Jin @@ -46,9 +46,9 @@ Reranker 服务提供统一的 `/rerank` API,支持可插拔后端(BGE、Jin
46 - `backends/dashscope_rerank.py`:DashScope 云端重排后端 46 - `backends/dashscope_rerank.py`:DashScope 云端重排后端
47 - `scripts/setup_reranker_venv.sh`:按后端创建独立 venv 47 - `scripts/setup_reranker_venv.sh`:按后端创建独立 venv
48 - `scripts/start_reranker.sh`:启动 reranker 服务 48 - `scripts/start_reranker.sh`:启动 reranker 服务
49 -- `scripts/smoke_qwen3_vllm_score_backend.py`:`qwen3_vllm_score` 本地 smoke  
50 -- `scripts/benchmark_reranker_random_titles.py`:随机标题压测脚本  
51 -- `scripts/run_reranker_vllm_instruction_benchmark.sh`:历史矩阵脚本 49 +- `benchmarks/reranker/smoke_qwen3_vllm_score_backend.py`:`qwen3_vllm_score` 本地 smoke
  50 +- `benchmarks/reranker/benchmark_reranker_random_titles.py`:随机标题压测脚本
  51 +- `benchmarks/reranker/run_reranker_vllm_instruction_benchmark.sh`:历史矩阵脚本
52 52
53 ## 环境基线 53 ## 环境基线
54 54
@@ -118,7 +118,7 @@ nvidia-smi @@ -118,7 +118,7 @@ nvidia-smi
118 ### 4. Smoke 118 ### 4. Smoke
119 119
120 ```bash 120 ```bash
121 -PYTHONPATH=. ./.venv-reranker-score/bin/python scripts/smoke_qwen3_vllm_score_backend.py --gpu-memory-utilization 0.2 121 +PYTHONPATH=. ./.venv-reranker-score/bin/python benchmarks/reranker/smoke_qwen3_vllm_score_backend.py --gpu-memory-utilization 0.2
122 ``` 122 ```
123 123
124 ## `jina_reranker_v3` 124 ## `jina_reranker_v3`
scripts/README.md 0 → 100644
@@ -0,0 +1,59 @@ @@ -0,0 +1,59 @@
  1 +# Scripts
  2 +
  3 +`scripts/` 现在只保留当前架构下仍然有效的运行、运维、环境和数据处理脚本,并按职责拆到稳定子目录,避免继续在根目录平铺。
  4 +
  5 +## 当前分类
  6 +
  7 +- 服务编排
  8 + - `service_ctl.sh`
  9 + - `start_backend.sh`
  10 + - `start_indexer.sh`
  11 + - `start_frontend.sh`
  12 + - `start_eval_web.sh`
  13 + - `start_embedding_service.sh`
  14 + - `start_embedding_text_service.sh`
  15 + - `start_embedding_image_service.sh`
  16 + - `start_reranker.sh`
  17 + - `start_translator.sh`
  18 + - `start_tei_service.sh`
  19 + - `start_cnclip_service.sh`
  20 + - `stop.sh`
  21 + - `stop_tei_service.sh`
  22 + - `stop_cnclip_service.sh`
  23 + - `frontend/`
  24 + - `ops/`
  25 +
  26 +- 环境初始化
  27 + - `create_venv.sh`
  28 + - `init_env.sh`
  29 + - `setup_embedding_venv.sh`
  30 + - `setup_reranker_venv.sh`
  31 + - `setup_translator_venv.sh`
  32 + - `setup_cnclip_venv.sh`
  33 +
  34 +- 数据与索引
  35 + - `create_tenant_index.sh`
  36 + - `build_suggestions.sh`
  37 + - `mock_data.sh`
  38 + - `data_import/`
  39 + - `inspect/`
  40 + - `maintenance/`
  41 +
  42 +- 评估与专项工具
  43 + - `evaluation/`
  44 + - `redis/`
  45 + - `debug/`
  46 + - `translation/`
  47 +
  48 +## 已迁移
  49 +
  50 +- 基准压测与 smoke 脚本:迁到 `benchmarks/`
  51 +- 手工接口试跑脚本:迁到 `tests/manual/`
  52 +
  53 +## 已清理
  54 +
  55 +- 历史备份目录:`indexer__old_2025_11/`
  56 +- 过时壳脚本:`start.sh`
  57 +- Conda 时代残留:`install_server_deps.sh`
  58 +
  59 +后续如果新增脚本,优先放到明确子目录,不再把 benchmark、manual、历史备份直接丢回根 `scripts/`。
scripts/data_import/README.md 0 → 100644
@@ -0,0 +1,13 @@ @@ -0,0 +1,13 @@
  1 +# Data Import Scripts
  2 +
  3 +这一组脚本用于把外部商品数据或 CSV/XLSX 样本转换为 Shoplazza 导入格式。
  4 +
  5 +- `amazon_xlsx_to_shoplazza_xlsx.py`
  6 +- `competitor_xlsx_to_shoplazza_xlsx.py`
  7 +- `csv_to_excel.py`
  8 +- `csv_to_excel_multi_variant.py`
  9 +- `shoplazza_excel_template.py`
  10 +- `shoplazza_import_template.py`
  11 +- `tenant3_csv_to_shoplazza_xlsx.sh`
  12 +
  13 +这里是离线数据转换工具,不属于线上服务运维入口。
scripts/amazon_xlsx_to_shoplazza_xlsx.py renamed to scripts/data_import/amazon_xlsx_to_shoplazza_xlsx.py
@@ -35,9 +35,10 @@ from pathlib import Path @@ -35,9 +35,10 @@ from pathlib import Path
35 35
36 from openpyxl import load_workbook 36 from openpyxl import load_workbook
37 37
38 -# Allow running as `python scripts/xxx.py` without installing as a package  
39 -sys.path.insert(0, str(Path(__file__).resolve().parent))  
40 -from shoplazza_excel_template import create_excel_from_template_fast 38 +REPO_ROOT = Path(__file__).resolve().parents[2]
  39 +sys.path.insert(0, str(REPO_ROOT))
  40 +
  41 +from scripts.data_import.shoplazza_excel_template import create_excel_from_template_fast
41 42
42 43
43 PREFERRED_OPTION_KEYS = [ 44 PREFERRED_OPTION_KEYS = [
@@ -612,4 +613,3 @@ def main(): @@ -612,4 +613,3 @@ def main():
612 if __name__ == "__main__": 613 if __name__ == "__main__":
613 main() 614 main()
614 615
615 -  
scripts/competitor_xlsx_to_shoplazza_xlsx.py renamed to scripts/data_import/competitor_xlsx_to_shoplazza_xlsx.py
@@ -6,7 +6,7 @@ The input `data/mai_jia_jing_ling/products_data/*.xlsx` files are Amazon-format @@ -6,7 +6,7 @@ The input `data/mai_jia_jing_ling/products_data/*.xlsx` files are Amazon-format
6 (Parent/Child ASIN), not “competitor data”. 6 (Parent/Child ASIN), not “competitor data”.
7 7
8 Please use: 8 Please use:
9 - - `scripts/amazon_xlsx_to_shoplazza_xlsx.py` 9 + - `scripts/data_import/amazon_xlsx_to_shoplazza_xlsx.py`
10 10
11 This wrapper simply forwards all CLI args to the correctly named script, so you 11 This wrapper simply forwards all CLI args to the correctly named script, so you
12 automatically get the latest performance improvements (fast read/write). 12 automatically get the latest performance improvements (fast read/write).
@@ -15,13 +15,12 @@ automatically get the latest performance improvements (fast read/write). @@ -15,13 +15,12 @@ automatically get the latest performance improvements (fast read/write).
15 import sys 15 import sys
16 from pathlib import Path 16 from pathlib import Path
17 17
18 -# Allow running as `python scripts/xxx.py` without installing as a package  
19 -sys.path.insert(0, str(Path(__file__).resolve().parent)) 18 +REPO_ROOT = Path(__file__).resolve().parents[2]
  19 +sys.path.insert(0, str(REPO_ROOT))
20 20
21 -from amazon_xlsx_to_shoplazza_xlsx import main as amazon_main 21 +from scripts.data_import.amazon_xlsx_to_shoplazza_xlsx import main as amazon_main
22 22
23 23
24 if __name__ == "__main__": 24 if __name__ == "__main__":
25 amazon_main() 25 amazon_main()
26 26
27 -  
scripts/csv_to_excel.py renamed to scripts/data_import/csv_to_excel.py
@@ -22,12 +22,12 @@ from openpyxl import load_workbook @@ -22,12 +22,12 @@ from openpyxl import load_workbook
22 from openpyxl.styles import Font, Alignment 22 from openpyxl.styles import Font, Alignment
23 from openpyxl.utils import get_column_letter 23 from openpyxl.utils import get_column_letter
24 24
25 -# Shared helpers (keeps template writing consistent across scripts)  
26 -from scripts.shoplazza_import_template import create_excel_from_template as _create_excel_from_template_shared  
27 -from scripts.shoplazza_import_template import generate_handle as _generate_handle_shared 25 +REPO_ROOT = Path(__file__).resolve().parents[2]
  26 +sys.path.insert(0, str(REPO_ROOT))
28 27
29 -# Add parent directory to path  
30 -sys.path.insert(0, str(Path(__file__).parent.parent)) 28 +# Shared helpers (keeps template writing consistent across scripts)
  29 +from scripts.data_import.shoplazza_import_template import create_excel_from_template as _create_excel_from_template_shared
  30 +from scripts.data_import.shoplazza_import_template import generate_handle as _generate_handle_shared
31 31
32 32
33 def clean_value(value): 33 def clean_value(value):
@@ -299,4 +299,3 @@ def main(): @@ -299,4 +299,3 @@ def main():
299 299
300 if __name__ == '__main__': 300 if __name__ == '__main__':
301 main() 301 main()
302 -  
scripts/csv_to_excel_multi_variant.py renamed to scripts/data_import/csv_to_excel_multi_variant.py
@@ -22,12 +22,12 @@ import itertools @@ -22,12 +22,12 @@ import itertools
22 from openpyxl import load_workbook 22 from openpyxl import load_workbook
23 from openpyxl.styles import Alignment 23 from openpyxl.styles import Alignment
24 24
25 -# Shared helpers (keeps template writing consistent across scripts)  
26 -from scripts.shoplazza_import_template import create_excel_from_template as _create_excel_from_template_shared  
27 -from scripts.shoplazza_import_template import generate_handle as _generate_handle_shared 25 +REPO_ROOT = Path(__file__).resolve().parents[2]
  26 +sys.path.insert(0, str(REPO_ROOT))
28 27
29 -# Add parent directory to path  
30 -sys.path.insert(0, str(Path(__file__).parent.parent)) 28 +# Shared helpers (keeps template writing consistent across scripts)
  29 +from scripts.data_import.shoplazza_import_template import create_excel_from_template as _create_excel_from_template_shared
  30 +from scripts.data_import.shoplazza_import_template import generate_handle as _generate_handle_shared
31 31
32 # Color definitions 32 # Color definitions
33 COLORS = [ 33 COLORS = [
@@ -562,4 +562,3 @@ def main(): @@ -562,4 +562,3 @@ def main():
562 562
563 if __name__ == '__main__': 563 if __name__ == '__main__':
564 main() 564 main()
565 -  
scripts/shoplazza_excel_template.py renamed to scripts/data_import/shoplazza_excel_template.py
scripts/shoplazza_import_template.py renamed to scripts/data_import/shoplazza_import_template.py
scripts/tenant3__csv_to_shoplazza_xlsx.sh renamed to scripts/data_import/tenant3_csv_to_shoplazza_xlsx.sh
@@ -5,16 +5,16 @@ cd &quot;$(dirname &quot;$0&quot;)/..&quot; @@ -5,16 +5,16 @@ cd &quot;$(dirname &quot;$0&quot;)/..&quot;
5 source ./activate.sh 5 source ./activate.sh
6 6
7 # # 基本使用(生成所有数据) 7 # # 基本使用(生成所有数据)
8 -# python scripts/csv_to_excel.py 8 +# python scripts/data_import/csv_to_excel.py
9 9
10 # # 指定输出文件 10 # # 指定输出文件
11 -# python scripts/csv_to_excel.py --output tenant3_imports.xlsx 11 +# python scripts/data_import/csv_to_excel.py --output tenant3_imports.xlsx
12 12
13 # # 限制处理行数(用于测试) 13 # # 限制处理行数(用于测试)
14 -# python scripts/csv_to_excel.py --limit 100 14 +# python scripts/data_import/csv_to_excel.py --limit 100
15 15
16 # 指定CSV文件和模板文件 16 # 指定CSV文件和模板文件
17 -python scripts/csv_to_excel.py \ 17 +python scripts/data_import/csv_to_excel.py \
18 --csv-file data/customer1/goods_with_pic.5years_congku.csv.shuf.1w \ 18 --csv-file data/customer1/goods_with_pic.5years_congku.csv.shuf.1w \
19 --template docs/商品导入模板.xlsx \ 19 --template docs/商品导入模板.xlsx \
20 - --output tenant3_imports.xlsx  
21 \ No newline at end of file 20 \ No newline at end of file
  21 + --output tenant3_imports.xlsx
scripts/trace_indexer_calls.sh renamed to scripts/debug/trace_indexer_calls.sh
1 #!/bin/bash 1 #!/bin/bash
2 # 2 #
3 # 排查「谁在调用索引服务」的脚本 3 # 排查「谁在调用索引服务」的脚本
4 -# 用法: ./scripts/trace_indexer_calls.sh 4 +# 用法: ./scripts/debug/trace_indexer_calls.sh
5 # 5 #
6 6
7 set -euo pipefail 7 set -euo pipefail
scripts/download_translation_models.py 100755 → 100644
1 #!/usr/bin/env python3 1 #!/usr/bin/env python3
2 -"""Download local translation models declared in services.translation.capabilities.""" 2 +"""Backward-compatible entrypoint for translation model downloads."""
3 3
4 from __future__ import annotations 4 from __future__ import annotations
5 5
6 -import argparse  
7 -import os 6 +import runpy
8 from pathlib import Path 7 from pathlib import Path
9 -import shutil  
10 -import subprocess  
11 -import sys  
12 -from typing import Iterable  
13 -  
14 -from huggingface_hub import snapshot_download  
15 -  
16 -PROJECT_ROOT = Path(__file__).resolve().parent.parent  
17 -if str(PROJECT_ROOT) not in sys.path:  
18 - sys.path.insert(0, str(PROJECT_ROOT))  
19 -os.environ.setdefault("HF_HUB_DISABLE_XET", "1")  
20 -  
21 -from config.services_config import get_translation_config  
22 -  
23 -  
24 -LOCAL_BACKENDS = {"local_nllb", "local_marian"}  
25 -  
26 -  
27 -def iter_local_capabilities(selected: set[str] | None = None) -> Iterable[tuple[str, dict]]:  
28 - cfg = get_translation_config()  
29 - capabilities = cfg.get("capabilities", {}) if isinstance(cfg, dict) else {}  
30 - for name, capability in capabilities.items():  
31 - backend = str(capability.get("backend") or "").strip().lower()  
32 - if backend not in LOCAL_BACKENDS:  
33 - continue  
34 - if selected and name not in selected:  
35 - continue  
36 - yield name, capability  
37 -  
38 -  
39 -def _compute_ct2_output_dir(capability: dict) -> Path:  
40 - custom = str(capability.get("ct2_model_dir") or "").strip()  
41 - if custom:  
42 - return Path(custom).expanduser()  
43 - model_dir = Path(str(capability.get("model_dir") or "")).expanduser()  
44 - compute_type = str(capability.get("ct2_compute_type") or capability.get("torch_dtype") or "default").strip().lower()  
45 - normalized = compute_type.replace("_", "-")  
46 - return model_dir / f"ctranslate2-{normalized}"  
47 -  
48 -  
49 -def _resolve_converter_binary() -> str:  
50 - candidate = shutil.which("ct2-transformers-converter")  
51 - if candidate:  
52 - return candidate  
53 - venv_candidate = Path(sys.executable).absolute().parent / "ct2-transformers-converter"  
54 - if venv_candidate.exists():  
55 - return str(venv_candidate)  
56 - raise RuntimeError(  
57 - "ct2-transformers-converter was not found. "  
58 - "Install ctranslate2 in the active Python environment first."  
59 - )  
60 -  
61 -  
62 -def convert_to_ctranslate2(name: str, capability: dict) -> None:  
63 - model_id = str(capability.get("model_id") or "").strip()  
64 - model_dir = Path(str(capability.get("model_dir") or "")).expanduser()  
65 - model_source = str(model_dir if model_dir.exists() else model_id)  
66 - output_dir = _compute_ct2_output_dir(capability)  
67 - if (output_dir / "model.bin").exists():  
68 - print(f"[skip-convert] {name} -> {output_dir}")  
69 - return  
70 - quantization = str(  
71 - capability.get("ct2_conversion_quantization")  
72 - or capability.get("ct2_compute_type")  
73 - or capability.get("torch_dtype")  
74 - or "default"  
75 - ).strip()  
76 - output_dir.parent.mkdir(parents=True, exist_ok=True)  
77 - print(f"[convert] {name} -> {output_dir} ({quantization})")  
78 - subprocess.run(  
79 - [  
80 - _resolve_converter_binary(),  
81 - "--model",  
82 - model_source,  
83 - "--output_dir",  
84 - str(output_dir),  
85 - "--quantization",  
86 - quantization,  
87 - ],  
88 - check=True,  
89 - )  
90 - print(f"[converted] {name}")  
91 -  
92 -  
93 -def main() -> None:  
94 - parser = argparse.ArgumentParser(description="Download local translation models")  
95 - parser.add_argument("--all-local", action="store_true", help="Download all configured local translation models")  
96 - parser.add_argument("--models", nargs="*", default=[], help="Specific capability names to download")  
97 - parser.add_argument(  
98 - "--convert-ctranslate2",  
99 - action="store_true",  
100 - help="Also convert the downloaded Hugging Face models into CTranslate2 format",  
101 - )  
102 - args = parser.parse_args()  
103 -  
104 - selected = {item.strip().lower() for item in args.models if item.strip()} or None  
105 - if not args.all_local and not selected:  
106 - parser.error("pass --all-local or --models <name> ...")  
107 -  
108 - for name, capability in iter_local_capabilities(selected):  
109 - model_id = str(capability.get("model_id") or "").strip()  
110 - model_dir = Path(str(capability.get("model_dir") or "")).expanduser()  
111 - if not model_id or not model_dir:  
112 - raise ValueError(f"Capability '{name}' must define model_id and model_dir")  
113 - model_dir.parent.mkdir(parents=True, exist_ok=True)  
114 - print(f"[download] {name} -> {model_dir} ({model_id})")  
115 - snapshot_download(  
116 - repo_id=model_id,  
117 - local_dir=str(model_dir),  
118 - )  
119 - print(f"[done] {name}")  
120 - if args.convert_ctranslate2:  
121 - convert_to_ctranslate2(name, capability)  
122 8
123 9
124 if __name__ == "__main__": 10 if __name__ == "__main__":
125 - main() 11 + target = Path(__file__).resolve().parent / "translation" / "download_translation_models.py"
  12 + runpy.run_path(str(target), run_name="__main__")
scripts/evaluation/README.md
@@ -127,8 +127,8 @@ This framework now follows graded ranking evaluation closer to e-commerce best p @@ -127,8 +127,8 @@ This framework now follows graded ranking evaluation closer to e-commerce best p
127 - **Composite tuning score: `Primary_Metric_Score`** 127 - **Composite tuning score: `Primary_Metric_Score`**
128 For experiment ranking we compute the mean of the primary scorecard after normalizing `Avg_Grade@10` by the max grade (`3`). 128 For experiment ranking we compute the mean of the primary scorecard after normalizing `Avg_Grade@10` by the max grade (`3`).
129 - **Gain scheme** 129 - **Gain scheme**
130 - `Fully Relevant=7`, `Mostly Relevant=3`, `Weakly Relevant=1`, `Irrelevant=0`  
131 - The gains come from rel grades `3/2/1/0` with `gain = 2^rel - 1`, a standard `NDCG` setup. 130 + `Fully Relevant=3`, `Mostly Relevant=2`, `Weakly Relevant=1`, `Irrelevant=0`
  131 + We keep the rel grades `3/2/1/0`, but the current implementation uses the grade values directly as gains so the exact/high gap is less aggressive.
132 - **Why this is better** 132 - **Why this is better**
133 `NDCG` differentiates “exact”, “strong substitute”, and “weak substitute”, so swapping an `Fully Relevant` with a `Weakly Relevant` item is penalized more than swapping `Mostly Relevant` with `Weakly Relevant`. 133 `NDCG` differentiates “exact”, “strong substitute”, and “weak substitute”, so swapping an `Fully Relevant` with a `Weakly Relevant` item is penalized more than swapping `Mostly Relevant` with `Weakly Relevant`.
134 134
@@ -174,6 +174,22 @@ Features: query list from `queries.txt`, single-query and batch evaluation, batc @@ -174,6 +174,22 @@ Features: query list from `queries.txt`, single-query and batch evaluation, batc
174 174
175 Each run stores aggregate and per-query metrics, label distribution, timestamp, metric context (including gain scheme and primary metric), and an `/admin/config` snapshot, as Markdown and JSON under `batch_reports/`. 175 Each run stores aggregate and per-query metrics, label distribution, timestamp, metric context (including gain scheme and primary metric), and an `/admin/config` snapshot, as Markdown and JSON under `batch_reports/`.
176 176
  177 +To make later case analysis reproducible without digging through backend logs, each per-query record in the batch JSON now also includes:
  178 +
  179 +- `request_id` — the exact `X-Request-ID` sent by the evaluator for that live search call
  180 +- `top_label_sequence_top10` / `top_label_sequence_top20` — compact label sequence strings such as `1:L3 | 2:L1 | 3:L2`
  181 +- `top_results` — a lightweight top-20 snapshot with `rank`, `spu_id`, `label`, title fields, and `relevance_score`
  182 +
  183 +The Markdown report now surfaces the same case context in a lighter human-readable form:
  184 +
  185 +- request id
  186 +- top-10 / top-20 label sequence
  187 +- top 5 result snapshot for quick scanning
  188 +
  189 +This means a bad case can usually be reconstructed directly from the batch artifact itself, without replaying logs or joining SQLite tables by hand.
  190 +
  191 +The web history endpoint intentionally returns a compact summary only (aggregate metrics plus query count), so adding richer per-query snapshots to the batch payload does not bloat the history list UI.
  192 +
177 ## Ranking debug and LTR prep 193 ## Ranking debug and LTR prep
178 194
179 `debug_info` now exposes two extra layers that are useful for tuning and future learning-to-rank work: 195 `debug_info` now exposes two extra layers that are useful for tuning and future learning-to-rank work:
scripts/evaluation/eval_framework/__init__.py
@@ -14,10 +14,10 @@ from .constants import ( # noqa: E402 @@ -14,10 +14,10 @@ from .constants import ( # noqa: E402
14 DEFAULT_ARTIFACT_ROOT, 14 DEFAULT_ARTIFACT_ROOT,
15 DEFAULT_QUERY_FILE, 15 DEFAULT_QUERY_FILE,
16 PROJECT_ROOT, 16 PROJECT_ROOT,
17 - RELEVANCE_EXACT,  
18 - RELEVANCE_HIGH,  
19 - RELEVANCE_IRRELEVANT,  
20 - RELEVANCE_LOW, 17 + RELEVANCE_LV0,
  18 + RELEVANCE_LV1,
  19 + RELEVANCE_LV2,
  20 + RELEVANCE_LV3,
21 RELEVANCE_NON_IRRELEVANT, 21 RELEVANCE_NON_IRRELEVANT,
22 VALID_LABELS, 22 VALID_LABELS,
23 ) 23 )
@@ -39,10 +39,10 @@ __all__ = [ @@ -39,10 +39,10 @@ __all__ = [
39 "EvalStore", 39 "EvalStore",
40 "PROJECT_ROOT", 40 "PROJECT_ROOT",
41 "QueryBuildResult", 41 "QueryBuildResult",
42 - "RELEVANCE_EXACT",  
43 - "RELEVANCE_HIGH",  
44 - "RELEVANCE_IRRELEVANT",  
45 - "RELEVANCE_LOW", 42 + "RELEVANCE_LV0",
  43 + "RELEVANCE_LV1",
  44 + "RELEVANCE_LV2",
  45 + "RELEVANCE_LV3",
46 "RELEVANCE_NON_IRRELEVANT", 46 "RELEVANCE_NON_IRRELEVANT",
47 "SearchEvaluationFramework", 47 "SearchEvaluationFramework",
48 "VALID_LABELS", 48 "VALID_LABELS",
scripts/evaluation/eval_framework/clients.py
@@ -157,6 +157,7 @@ class SearchServiceClient: @@ -157,6 +157,7 @@ class SearchServiceClient:
157 return self._request_json("GET", path, timeout=timeout) 157 return self._request_json("GET", path, timeout=timeout)
158 158
159 def search(self, query: str, size: int, from_: int = 0, language: str = "en", *, debug: bool = False) -> Dict[str, Any]: 159 def search(self, query: str, size: int, from_: int = 0, language: str = "en", *, debug: bool = False) -> Dict[str, Any]:
  160 + request_id = uuid.uuid4().hex[:8]
160 payload: Dict[str, Any] = { 161 payload: Dict[str, Any] = {
161 "query": query, 162 "query": query,
162 "size": size, 163 "size": size,
@@ -165,13 +166,19 @@ class SearchServiceClient: @@ -165,13 +166,19 @@ class SearchServiceClient:
165 } 166 }
166 if debug: 167 if debug:
167 payload["debug"] = True 168 payload["debug"] = True
168 - return self._request_json( 169 + response = self._request_json(
169 "POST", 170 "POST",
170 "/search/", 171 "/search/",
171 timeout=120, 172 timeout=120,
172 - headers={"Content-Type": "application/json", "X-Tenant-ID": self.tenant_id}, 173 + headers={
  174 + "Content-Type": "application/json",
  175 + "X-Tenant-ID": self.tenant_id,
  176 + "X-Request-ID": request_id,
  177 + },
173 json_payload=payload, 178 json_payload=payload,
174 ) 179 )
  180 + response["_eval_request_id"] = request_id
  181 + return response
175 182
176 183
177 class RerankServiceClient: 184 class RerankServiceClient:
scripts/evaluation/eval_framework/constants.py
@@ -7,24 +7,24 @@ _SCRIPTS_EVAL_DIR = _PKG_DIR.parent @@ -7,24 +7,24 @@ _SCRIPTS_EVAL_DIR = _PKG_DIR.parent
7 PROJECT_ROOT = _SCRIPTS_EVAL_DIR.parents[1] 7 PROJECT_ROOT = _SCRIPTS_EVAL_DIR.parents[1]
8 8
9 # Canonical English labels (must match LLM prompt output in prompts._CLASSIFY_TEMPLATE_EN) 9 # Canonical English labels (must match LLM prompt output in prompts._CLASSIFY_TEMPLATE_EN)
10 -RELEVANCE_EXACT = "Fully Relevant"  
11 -RELEVANCE_HIGH = "Mostly Relevant"  
12 -RELEVANCE_LOW = "Weakly Relevant"  
13 -RELEVANCE_IRRELEVANT = "Irrelevant" 10 +RELEVANCE_LV3 = "Fully Relevant"
  11 +RELEVANCE_LV2 = "Mostly Relevant"
  12 +RELEVANCE_LV1 = "Weakly Relevant"
  13 +RELEVANCE_LV0 = "Irrelevant"
14 14
15 -VALID_LABELS = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_LOW, RELEVANCE_IRRELEVANT}) 15 +VALID_LABELS = frozenset({RELEVANCE_LV3, RELEVANCE_LV2, RELEVANCE_LV1, RELEVANCE_LV0})
16 16
17 # Useful label sets for binary diagnostic slices layered on top of graded ranking metrics. 17 # Useful label sets for binary diagnostic slices layered on top of graded ranking metrics.
18 -RELEVANCE_NON_IRRELEVANT = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_LOW})  
19 -RELEVANCE_STRONG = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH}) 18 +RELEVANCE_NON_IRRELEVANT = frozenset({RELEVANCE_LV3, RELEVANCE_LV2, RELEVANCE_LV1})
  19 +RELEVANCE_STRONG = frozenset({RELEVANCE_LV3, RELEVANCE_LV2})
20 20
21 # Graded relevance for ranking evaluation. 21 # Graded relevance for ranking evaluation.
22 # We use rel grades 3/2/1/0 and gain = 2^rel - 1, which is standard for NDCG-style metrics. 22 # We use rel grades 3/2/1/0 and gain = 2^rel - 1, which is standard for NDCG-style metrics.
23 RELEVANCE_GRADE_MAP = { 23 RELEVANCE_GRADE_MAP = {
24 - RELEVANCE_EXACT: 3,  
25 - RELEVANCE_HIGH: 2,  
26 - RELEVANCE_LOW: 1,  
27 - RELEVANCE_IRRELEVANT: 0, 24 + RELEVANCE_LV3: 3,
  25 + RELEVANCE_LV2: 2,
  26 + RELEVANCE_LV1: 1,
  27 + RELEVANCE_LV0: 0,
28 } 28 }
29 # 标准的gain计算方法:2^rel - 1 29 # 标准的gain计算方法:2^rel - 1
30 # 但是是因为标注质量不是特别精确,因此适当降低 exact 和 high 的区分度 30 # 但是是因为标注质量不是特别精确,因此适当降低 exact 和 high 的区分度
@@ -35,11 +35,12 @@ RELEVANCE_GAIN_MAP = { @@ -35,11 +35,12 @@ RELEVANCE_GAIN_MAP = {
35 } 35 }
36 36
37 # P(stop | relevance) for ERR (Expected Reciprocal Rank); cascade model (Chapelle et al., 2009). 37 # P(stop | relevance) for ERR (Expected Reciprocal Rank); cascade model (Chapelle et al., 2009).
  38 +# p(t) = (2^t - 1) / 2^{max_grade}
38 STOP_PROB_MAP = { 39 STOP_PROB_MAP = {
39 - RELEVANCE_EXACT: 0.99,  
40 - RELEVANCE_HIGH: 0.8,  
41 - RELEVANCE_LOW: 0.1,  
42 - RELEVANCE_IRRELEVANT: 0.0, 40 + RELEVANCE_LV3: 0.875,
  41 + RELEVANCE_LV2: 0.375,
  42 + RELEVANCE_LV1: 0.125,
  43 + RELEVANCE_LV0: 0.0,
43 } 44 }
44 45
45 DEFAULT_ARTIFACT_ROOT = PROJECT_ROOT / "artifacts" / "search_evaluation" 46 DEFAULT_ARTIFACT_ROOT = PROJECT_ROOT / "artifacts" / "search_evaluation"
@@ -78,7 +79,7 @@ DEFAULT_REBUILD_MAX_LLM_BATCHES = 40 @@ -78,7 +79,7 @@ DEFAULT_REBUILD_MAX_LLM_BATCHES = 40
78 # A batch is "bad" when **both** hold (strict inequalities; see ``framework._annotate_rebuild_batches``): 79 # A batch is "bad" when **both** hold (strict inequalities; see ``framework._annotate_rebuild_batches``):
79 # - irrelevant_ratio > DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO (default 93.9%), 80 # - irrelevant_ratio > DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO (default 93.9%),
80 # - (Irrelevant + Weakly Relevant) / n > DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO (default 95.9%). 81 # - (Irrelevant + Weakly Relevant) / n > DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO (default 95.9%).
81 -# ``irrelevant_ratio`` = Irrelevant count / n; weak relevance is ``RELEVANCE_LOW`` ("Weakly Relevant"). 82 +# ``irrelevant_ratio`` = Irrelevant count / n; weak relevance is ``RELEVANCE_LV1`` ("Weakly Relevant").
82 # Increment streak on consecutive bad batches; reset on any non-bad batch. Stop when streak 83 # Increment streak on consecutive bad batches; reset on any non-bad batch. Stop when streak
83 # reaches ``DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK`` (default 3). 84 # reaches ``DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK`` (default 3).
84 DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO = 0.799 85 DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO = 0.799
scripts/evaluation/eval_framework/framework.py
@@ -25,14 +25,14 @@ from .constants import ( @@ -25,14 +25,14 @@ from .constants import (
25 DEFAULT_RERANK_HIGH_SKIP_COUNT, 25 DEFAULT_RERANK_HIGH_SKIP_COUNT,
26 DEFAULT_RERANK_HIGH_THRESHOLD, 26 DEFAULT_RERANK_HIGH_THRESHOLD,
27 DEFAULT_SEARCH_RECALL_TOP_K, 27 DEFAULT_SEARCH_RECALL_TOP_K,
28 - RELEVANCE_EXACT,  
29 RELEVANCE_GAIN_MAP, 28 RELEVANCE_GAIN_MAP,
30 - RELEVANCE_HIGH,  
31 - STOP_PROB_MAP,  
32 - RELEVANCE_IRRELEVANT,  
33 - RELEVANCE_LOW, 29 + RELEVANCE_LV0,
  30 + RELEVANCE_LV1,
  31 + RELEVANCE_LV2,
  32 + RELEVANCE_LV3,
34 RELEVANCE_NON_IRRELEVANT, 33 RELEVANCE_NON_IRRELEVANT,
35 VALID_LABELS, 34 VALID_LABELS,
  35 + STOP_PROB_MAP,
36 ) 36 )
37 from .metrics import ( 37 from .metrics import (
38 PRIMARY_METRIC_GRADE_NORMALIZER, 38 PRIMARY_METRIC_GRADE_NORMALIZER,
@@ -96,6 +96,16 @@ def _zh_titles_from_debug_per_result(debug_info: Any) -&gt; Dict[str, str]: @@ -96,6 +96,16 @@ def _zh_titles_from_debug_per_result(debug_info: Any) -&gt; Dict[str, str]:
96 return out 96 return out
97 97
98 98
  99 +def _encode_label_sequence(items: Sequence[Dict[str, Any]], limit: int) -> str:
  100 + parts: List[str] = []
  101 + for item in items[:limit]:
  102 + rank = int(item.get("rank") or 0)
  103 + label = str(item.get("label") or "")
  104 + grade = RELEVANCE_GAIN_MAP.get(label)
  105 + parts.append(f"{rank}:L{grade}" if grade is not None else f"{rank}:?")
  106 + return " | ".join(parts)
  107 +
  108 +
99 class SearchEvaluationFramework: 109 class SearchEvaluationFramework:
100 def __init__( 110 def __init__(
101 self, 111 self,
@@ -168,7 +178,7 @@ class SearchEvaluationFramework: @@ -168,7 +178,7 @@ class SearchEvaluationFramework:
168 ) -> Dict[str, Any]: 178 ) -> Dict[str, Any]:
169 live = self.evaluate_live_query(query=query, top_k=top_k, auto_annotate=auto_annotate, language=language) 179 live = self.evaluate_live_query(query=query, top_k=top_k, auto_annotate=auto_annotate, language=language)
170 labels = [ 180 labels = [
171 - item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT 181 + item["label"] if item["label"] in VALID_LABELS else RELEVANCE_LV0
172 for item in live["results"] 182 for item in live["results"]
173 ] 183 ]
174 return { 184 return {
@@ -432,7 +442,7 @@ class SearchEvaluationFramework: @@ -432,7 +442,7 @@ class SearchEvaluationFramework:
432 442
433 - ``#(Irrelevant)/n > irrelevant_stop_ratio`` (default 0.939), and 443 - ``#(Irrelevant)/n > irrelevant_stop_ratio`` (default 0.939), and
434 - ``( #(Irrelevant) + #(Weakly Relevant) ) / n > irrelevant_low_combined_stop_ratio`` 444 - ``( #(Irrelevant) + #(Weakly Relevant) ) / n > irrelevant_low_combined_stop_ratio``
435 - (default 0.959; weak relevance = ``RELEVANCE_LOW``). 445 + (default 0.959; weak relevance = ``RELEVANCE_LV1``).
436 446
437 Maintain a streak of consecutive *bad* batches; any non-bad batch resets the streak to 0. 447 Maintain a streak of consecutive *bad* batches; any non-bad batch resets the streak to 0.
438 Stop labeling when ``streak >= stop_streak`` (default 3) or when ``max_batches`` is reached 448 Stop labeling when ``streak >= stop_streak`` (default 3) or when ``max_batches`` is reached
@@ -474,9 +484,9 @@ class SearchEvaluationFramework: @@ -474,9 +484,9 @@ class SearchEvaluationFramework:
474 time.sleep(0.1) 484 time.sleep(0.1)
475 485
476 n = len(batch_docs) 486 n = len(batch_docs)
477 - exact_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_EXACT)  
478 - irrel_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_IRRELEVANT)  
479 - low_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_LOW) 487 + exact_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_LV3)
  488 + irrel_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_LV0)
  489 + low_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_LV1)
480 exact_ratio = exact_n / n if n else 0.0 490 exact_ratio = exact_n / n if n else 0.0
481 irrelevant_ratio = irrel_n / n if n else 0.0 491 irrelevant_ratio = irrel_n / n if n else 0.0
482 low_ratio = low_n / n if n else 0.0 492 low_ratio = low_n / n if n else 0.0
@@ -633,7 +643,7 @@ class SearchEvaluationFramework: @@ -633,7 +643,7 @@ class SearchEvaluationFramework:
633 ) 643 )
634 644
635 top100_labels = [ 645 top100_labels = [
636 - item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT 646 + item["label"] if item["label"] in VALID_LABELS else RELEVANCE_LV0
637 for item in search_labeled_results[:100] 647 for item in search_labeled_results[:100]
638 ] 648 ]
639 metrics = compute_query_metrics(top100_labels, ideal_labels=list(labels.values())) 649 metrics = compute_query_metrics(top100_labels, ideal_labels=list(labels.values()))
@@ -843,7 +853,7 @@ class SearchEvaluationFramework: @@ -843,7 +853,7 @@ class SearchEvaluationFramework:
843 ) 853 )
844 854
845 top100_labels = [ 855 top100_labels = [
846 - item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT 856 + item["label"] if item["label"] in VALID_LABELS else RELEVANCE_LV0
847 for item in search_labeled_results[:100] 857 for item in search_labeled_results[:100]
848 ] 858 ]
849 metrics = compute_query_metrics(top100_labels, ideal_labels=list(labels.values())) 859 metrics = compute_query_metrics(top100_labels, ideal_labels=list(labels.values()))
@@ -920,16 +930,17 @@ class SearchEvaluationFramework: @@ -920,16 +930,17 @@ class SearchEvaluationFramework:
920 "title_zh": title_zh if title_zh and title_zh != primary_title else "", 930 "title_zh": title_zh if title_zh and title_zh != primary_title else "",
921 "image_url": doc.get("image_url"), 931 "image_url": doc.get("image_url"),
922 "label": label, 932 "label": label,
  933 + "relevance_score": doc.get("relevance_score"),
923 "option_values": list(compact_option_values(doc.get("skus") or [])), 934 "option_values": list(compact_option_values(doc.get("skus") or [])),
924 "product": compact_product_payload(doc), 935 "product": compact_product_payload(doc),
925 } 936 }
926 ) 937 )
927 metric_labels = [ 938 metric_labels = [
928 - item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT 939 + item["label"] if item["label"] in VALID_LABELS else RELEVANCE_LV0
929 for item in labeled 940 for item in labeled
930 ] 941 ]
931 ideal_labels = [ 942 ideal_labels = [
932 - label if label in VALID_LABELS else RELEVANCE_IRRELEVANT 943 + label if label in VALID_LABELS else RELEVANCE_LV0
933 for label in labels.values() 944 for label in labels.values()
934 ] 945 ]
935 label_stats = self.store.get_query_label_stats(self.tenant_id, query) 946 label_stats = self.store.get_query_label_stats(self.tenant_id, query)
@@ -960,10 +971,10 @@ class SearchEvaluationFramework: @@ -960,10 +971,10 @@ class SearchEvaluationFramework:
960 } 971 }
961 ) 972 )
962 label_order = { 973 label_order = {
963 - RELEVANCE_EXACT: 0,  
964 - RELEVANCE_HIGH: 1,  
965 - RELEVANCE_LOW: 2,  
966 - RELEVANCE_IRRELEVANT: 3, 974 + RELEVANCE_LV3: 0,
  975 + RELEVANCE_LV2: 1,
  976 + RELEVANCE_LV1: 2,
  977 + RELEVANCE_LV0: 3,
967 } 978 }
968 missing_relevant.sort( 979 missing_relevant.sort(
969 key=lambda item: ( 980 key=lambda item: (
@@ -989,6 +1000,7 @@ class SearchEvaluationFramework: @@ -989,6 +1000,7 @@ class SearchEvaluationFramework:
989 "top_k": top_k, 1000 "top_k": top_k,
990 "metrics": compute_query_metrics(metric_labels, ideal_labels=ideal_labels), 1001 "metrics": compute_query_metrics(metric_labels, ideal_labels=ideal_labels),
991 "metric_context": _metric_context_payload(), 1002 "metric_context": _metric_context_payload(),
  1003 + "request_id": str(search_payload.get("_eval_request_id") or ""),
992 "results": labeled, 1004 "results": labeled,
993 "missing_relevant": missing_relevant, 1005 "missing_relevant": missing_relevant,
994 "label_stats": { 1006 "label_stats": {
@@ -996,9 +1008,9 @@ class SearchEvaluationFramework: @@ -996,9 +1008,9 @@ class SearchEvaluationFramework:
996 "unlabeled_hits_treated_irrelevant": unlabeled_hits, 1008 "unlabeled_hits_treated_irrelevant": unlabeled_hits,
997 "recalled_hits": len(labeled), 1009 "recalled_hits": len(labeled),
998 "missing_relevant_count": len(missing_relevant), 1010 "missing_relevant_count": len(missing_relevant),
999 - "missing_exact_count": sum(1 for item in missing_relevant if item["label"] == RELEVANCE_EXACT),  
1000 - "missing_high_count": sum(1 for item in missing_relevant if item["label"] == RELEVANCE_HIGH),  
1001 - "missing_low_count": sum(1 for item in missing_relevant if item["label"] == RELEVANCE_LOW), 1011 + "missing_exact_count": sum(1 for item in missing_relevant if item["label"] == RELEVANCE_LV3),
  1012 + "missing_high_count": sum(1 for item in missing_relevant if item["label"] == RELEVANCE_LV2),
  1013 + "missing_low_count": sum(1 for item in missing_relevant if item["label"] == RELEVANCE_LV1),
1002 }, 1014 },
1003 "tips": tips, 1015 "tips": tips,
1004 "total": int(search_payload.get("total") or 0), 1016 "total": int(search_payload.get("total") or 0),
@@ -1014,6 +1026,7 @@ class SearchEvaluationFramework: @@ -1014,6 +1026,7 @@ class SearchEvaluationFramework:
1014 force_refresh_labels: bool = False, 1026 force_refresh_labels: bool = False,
1015 ) -> Dict[str, Any]: 1027 ) -> Dict[str, Any]:
1016 per_query = [] 1028 per_query = []
  1029 + case_snapshot_top_n = min(max(int(top_k), 1), 20)
1017 total_q = len(queries) 1030 total_q = len(queries)
1018 _log.info("[batch-eval] starting %s queries top_k=%s auto_annotate=%s", total_q, top_k, auto_annotate) 1031 _log.info("[batch-eval] starting %s queries top_k=%s auto_annotate=%s", total_q, top_k, auto_annotate)
1019 for q_index, query in enumerate(queries, start=1): 1032 for q_index, query in enumerate(queries, start=1):
@@ -1025,7 +1038,7 @@ class SearchEvaluationFramework: @@ -1025,7 +1038,7 @@ class SearchEvaluationFramework:
1025 force_refresh_labels=force_refresh_labels, 1038 force_refresh_labels=force_refresh_labels,
1026 ) 1039 )
1027 labels = [ 1040 labels = [
1028 - item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT 1041 + item["label"] if item["label"] in VALID_LABELS else RELEVANCE_LV0
1029 for item in live["results"] 1042 for item in live["results"]
1030 ] 1043 ]
1031 per_query.append( 1044 per_query.append(
@@ -1036,6 +1049,21 @@ class SearchEvaluationFramework: @@ -1036,6 +1049,21 @@ class SearchEvaluationFramework:
1036 "metrics": live["metrics"], 1049 "metrics": live["metrics"],
1037 "distribution": label_distribution(labels), 1050 "distribution": label_distribution(labels),
1038 "total": live["total"], 1051 "total": live["total"],
  1052 + "request_id": live.get("request_id") or "",
  1053 + "case_snapshot_top_n": case_snapshot_top_n,
  1054 + "top_label_sequence_top10": _encode_label_sequence(live["results"], 10),
  1055 + "top_label_sequence_top20": _encode_label_sequence(live["results"], case_snapshot_top_n),
  1056 + "top_results": [
  1057 + {
  1058 + "rank": int(item.get("rank") or 0),
  1059 + "spu_id": str(item.get("spu_id") or ""),
  1060 + "label": item.get("label"),
  1061 + "title": item.get("title"),
  1062 + "title_zh": item.get("title_zh"),
  1063 + "relevance_score": item.get("relevance_score"),
  1064 + }
  1065 + for item in live["results"][:case_snapshot_top_n]
  1066 + ],
1039 } 1067 }
1040 ) 1068 )
1041 m = live["metrics"] 1069 m = live["metrics"]
@@ -1055,10 +1083,10 @@ class SearchEvaluationFramework: @@ -1055,10 +1083,10 @@ class SearchEvaluationFramework:
1055 ) 1083 )
1056 aggregate = aggregate_metrics([item["metrics"] for item in per_query]) 1084 aggregate = aggregate_metrics([item["metrics"] for item in per_query])
1057 aggregate_distribution = { 1085 aggregate_distribution = {
1058 - RELEVANCE_EXACT: sum(item["distribution"][RELEVANCE_EXACT] for item in per_query),  
1059 - RELEVANCE_HIGH: sum(item["distribution"][RELEVANCE_HIGH] for item in per_query),  
1060 - RELEVANCE_LOW: sum(item["distribution"][RELEVANCE_LOW] for item in per_query),  
1061 - RELEVANCE_IRRELEVANT: sum(item["distribution"][RELEVANCE_IRRELEVANT] for item in per_query), 1086 + RELEVANCE_LV3: sum(item["distribution"][RELEVANCE_LV3] for item in per_query),
  1087 + RELEVANCE_LV2: sum(item["distribution"][RELEVANCE_LV2] for item in per_query),
  1088 + RELEVANCE_LV1: sum(item["distribution"][RELEVANCE_LV1] for item in per_query),
  1089 + RELEVANCE_LV0: sum(item["distribution"][RELEVANCE_LV0] for item in per_query),
1062 } 1090 }
1063 batch_id = f"batch_{utc_timestamp()}_{sha1_text(self.tenant_id + '|' + '|'.join(queries))[:10]}" 1091 batch_id = f"batch_{utc_timestamp()}_{sha1_text(self.tenant_id + '|' + '|'.join(queries))[:10]}"
1064 report_dir = ensure_dir(self.artifact_root / "batch_reports") 1092 report_dir = ensure_dir(self.artifact_root / "batch_reports")
scripts/evaluation/eval_framework/metrics.py
@@ -6,12 +6,12 @@ import math @@ -6,12 +6,12 @@ import math
6 from typing import Dict, Iterable, Sequence 6 from typing import Dict, Iterable, Sequence
7 7
8 from .constants import ( 8 from .constants import (
9 - RELEVANCE_EXACT,  
10 RELEVANCE_GAIN_MAP, 9 RELEVANCE_GAIN_MAP,
11 RELEVANCE_GRADE_MAP, 10 RELEVANCE_GRADE_MAP,
12 - RELEVANCE_HIGH,  
13 - RELEVANCE_IRRELEVANT,  
14 - RELEVANCE_LOW, 11 + RELEVANCE_LV0,
  12 + RELEVANCE_LV1,
  13 + RELEVANCE_LV2,
  14 + RELEVANCE_LV3,
15 RELEVANCE_NON_IRRELEVANT, 15 RELEVANCE_NON_IRRELEVANT,
16 RELEVANCE_STRONG, 16 RELEVANCE_STRONG,
17 STOP_PROB_MAP, 17 STOP_PROB_MAP,
@@ -33,7 +33,7 @@ PRIMARY_METRIC_GRADE_NORMALIZER = float(max(RELEVANCE_GRADE_MAP.values()) or 1.0 @@ -33,7 +33,7 @@ PRIMARY_METRIC_GRADE_NORMALIZER = float(max(RELEVANCE_GRADE_MAP.values()) or 1.0
33 def _normalize_label(label: str) -> str: 33 def _normalize_label(label: str) -> str:
34 if label in RELEVANCE_GRADE_MAP: 34 if label in RELEVANCE_GRADE_MAP:
35 return label 35 return label
36 - return RELEVANCE_IRRELEVANT 36 + return RELEVANCE_LV0
37 37
38 38
39 def _gains_for_labels(labels: Sequence[str]) -> list[float]: 39 def _gains_for_labels(labels: Sequence[str]) -> list[float]:
@@ -135,7 +135,7 @@ def compute_query_metrics( @@ -135,7 +135,7 @@ def compute_query_metrics(
135 ideal = list(ideal_labels) if ideal_labels is not None else list(labels) 135 ideal = list(ideal_labels) if ideal_labels is not None else list(labels)
136 metrics: Dict[str, float] = {} 136 metrics: Dict[str, float] = {}
137 137
138 - exact_hits = _binary_hits(labels, [RELEVANCE_EXACT]) 138 + exact_hits = _binary_hits(labels, [RELEVANCE_LV3])
139 strong_hits = _binary_hits(labels, RELEVANCE_STRONG) 139 strong_hits = _binary_hits(labels, RELEVANCE_STRONG)
140 useful_hits = _binary_hits(labels, RELEVANCE_NON_IRRELEVANT) 140 useful_hits = _binary_hits(labels, RELEVANCE_NON_IRRELEVANT)
141 141
@@ -183,8 +183,8 @@ def aggregate_metrics(metric_items: Sequence[Dict[str, float]]) -&gt; Dict[str, flo @@ -183,8 +183,8 @@ def aggregate_metrics(metric_items: Sequence[Dict[str, float]]) -&gt; Dict[str, flo
183 183
184 def label_distribution(labels: Sequence[str]) -> Dict[str, int]: 184 def label_distribution(labels: Sequence[str]) -> Dict[str, int]:
185 return { 185 return {
186 - RELEVANCE_EXACT: sum(1 for label in labels if label == RELEVANCE_EXACT),  
187 - RELEVANCE_HIGH: sum(1 for label in labels if label == RELEVANCE_HIGH),  
188 - RELEVANCE_LOW: sum(1 for label in labels if label == RELEVANCE_LOW),  
189 - RELEVANCE_IRRELEVANT: sum(1 for label in labels if label == RELEVANCE_IRRELEVANT), 186 + RELEVANCE_LV3: sum(1 for label in labels if label == RELEVANCE_LV3),
  187 + RELEVANCE_LV2: sum(1 for label in labels if label == RELEVANCE_LV2),
  188 + RELEVANCE_LV1: sum(1 for label in labels if label == RELEVANCE_LV1),
  189 + RELEVANCE_LV0: sum(1 for label in labels if label == RELEVANCE_LV0),
190 } 190 }
scripts/evaluation/eval_framework/reports.py
@@ -4,7 +4,7 @@ from __future__ import annotations @@ -4,7 +4,7 @@ from __future__ import annotations
4 4
5 from typing import Any, Dict 5 from typing import Any, Dict
6 6
7 -from .constants import RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_IRRELEVANT, RELEVANCE_LOW 7 +from .constants import RELEVANCE_GAIN_MAP, RELEVANCE_LV0, RELEVANCE_LV1, RELEVANCE_LV2, RELEVANCE_LV3
8 from .metrics import PRIMARY_METRIC_KEYS 8 from .metrics import PRIMARY_METRIC_KEYS
9 9
10 10
@@ -25,6 +25,38 @@ def _append_metric_block(lines: list[str], metrics: Dict[str, Any]) -&gt; None: @@ -25,6 +25,38 @@ def _append_metric_block(lines: list[str], metrics: Dict[str, Any]) -&gt; None:
25 lines.append(f"- {key}: {value}") 25 lines.append(f"- {key}: {value}")
26 26
27 27
  28 +def _label_level_code(label: str) -> str:
  29 + grade = RELEVANCE_GAIN_MAP.get(label)
  30 + return f"L{grade}" if grade is not None else "?"
  31 +
  32 +
  33 +def _append_case_snapshot(lines: list[str], item: Dict[str, Any]) -> None:
  34 + request_id = str(item.get("request_id") or "").strip()
  35 + if request_id:
  36 + lines.append(f"- Request ID: `{request_id}`")
  37 + seq10 = str(item.get("top_label_sequence_top10") or "").strip()
  38 + if seq10:
  39 + lines.append(f"- Top-10 Labels: `{seq10}`")
  40 + seq20 = str(item.get("top_label_sequence_top20") or "").strip()
  41 + if seq20 and seq20 != seq10:
  42 + lines.append(f"- Top-20 Labels: `{seq20}`")
  43 + top_results = item.get("top_results") or []
  44 + if not top_results:
  45 + return
  46 + lines.append("- Case Snapshot:")
  47 + for result in top_results[:5]:
  48 + rank = int(result.get("rank") or 0)
  49 + label = _label_level_code(str(result.get("label") or ""))
  50 + spu_id = str(result.get("spu_id") or "")
  51 + title = str(result.get("title") or "")
  52 + title_zh = str(result.get("title_zh") or "")
  53 + relevance_score = result.get("relevance_score")
  54 + score_suffix = f" (rel={relevance_score})" if relevance_score not in (None, "") else ""
  55 + lines.append(f" - #{rank} [{label}] spu={spu_id} {title}{score_suffix}")
  56 + if title_zh:
  57 + lines.append(f" zh: {title_zh}")
  58 +
  59 +
28 def render_batch_report_markdown(payload: Dict[str, Any]) -> str: 60 def render_batch_report_markdown(payload: Dict[str, Any]) -> str:
29 lines = [ 61 lines = [
30 "# Search Batch Evaluation", 62 "# Search Batch Evaluation",
@@ -56,10 +88,10 @@ def render_batch_report_markdown(payload: Dict[str, Any]) -&gt; str: @@ -56,10 +88,10 @@ def render_batch_report_markdown(payload: Dict[str, Any]) -&gt; str:
56 "", 88 "",
57 "## Label Distribution", 89 "## Label Distribution",
58 "", 90 "",
59 - f"- Fully Relevant: {distribution.get(RELEVANCE_EXACT, 0)}",  
60 - f"- Mostly Relevant: {distribution.get(RELEVANCE_HIGH, 0)}",  
61 - f"- Weakly Relevant: {distribution.get(RELEVANCE_LOW, 0)}",  
62 - f"- Irrelevant: {distribution.get(RELEVANCE_IRRELEVANT, 0)}", 91 + f"- Fully Relevant: {distribution.get(RELEVANCE_LV3, 0)}",
  92 + f"- Mostly Relevant: {distribution.get(RELEVANCE_LV2, 0)}",
  93 + f"- Weakly Relevant: {distribution.get(RELEVANCE_LV1, 0)}",
  94 + f"- Irrelevant: {distribution.get(RELEVANCE_LV0, 0)}",
63 ] 95 ]
64 ) 96 )
65 lines.extend(["", "## Per Query", ""]) 97 lines.extend(["", "## Per Query", ""])
@@ -68,9 +100,10 @@ def render_batch_report_markdown(payload: Dict[str, Any]) -&gt; str: @@ -68,9 +100,10 @@ def render_batch_report_markdown(payload: Dict[str, Any]) -&gt; str:
68 lines.append("") 100 lines.append("")
69 _append_metric_block(lines, item.get("metrics") or {}) 101 _append_metric_block(lines, item.get("metrics") or {})
70 distribution = item.get("distribution") or {} 102 distribution = item.get("distribution") or {}
71 - lines.append(f"- Fully Relevant: {distribution.get(RELEVANCE_EXACT, 0)}")  
72 - lines.append(f"- Mostly Relevant: {distribution.get(RELEVANCE_HIGH, 0)}")  
73 - lines.append(f"- Weakly Relevant: {distribution.get(RELEVANCE_LOW, 0)}")  
74 - lines.append(f"- Irrelevant: {distribution.get(RELEVANCE_IRRELEVANT, 0)}") 103 + lines.append(f"- Fully Relevant: {distribution.get(RELEVANCE_LV3, 0)}")
  104 + lines.append(f"- Mostly Relevant: {distribution.get(RELEVANCE_LV2, 0)}")
  105 + lines.append(f"- Weakly Relevant: {distribution.get(RELEVANCE_LV1, 0)}")
  106 + lines.append(f"- Irrelevant: {distribution.get(RELEVANCE_LV0, 0)}")
  107 + _append_case_snapshot(lines, item)
75 lines.append("") 108 lines.append("")
76 return "\n".join(lines) 109 return "\n".join(lines)
scripts/evaluation/eval_framework/static/eval_web.js
@@ -190,7 +190,7 @@ async function loadQueries() { @@ -190,7 +190,7 @@ async function loadQueries() {
190 190
191 function historySummaryHtml(meta) { 191 function historySummaryHtml(meta) {
192 const m = meta && meta.aggregate_metrics; 192 const m = meta && meta.aggregate_metrics;
193 - const nq = (meta && meta.queries && meta.queries.length) || (meta && meta.per_query && meta.per_query.length) || null; 193 + const nq = (meta && meta.query_count) || (meta && meta.queries && meta.queries.length) || (meta && meta.per_query && meta.per_query.length) || null;
194 const parts = []; 194 const parts = [];
195 if (nq != null) parts.push(`<span>Queries</span> ${nq}`); 195 if (nq != null) parts.push(`<span>Queries</span> ${nq}`);
196 if (m && m["Primary_Metric_Score"] != null) parts.push(`<span>Primary</span> ${fmtNumber(m["Primary_Metric_Score"])}`); 196 if (m && m["Primary_Metric_Score"] != null) parts.push(`<span>Primary</span> ${fmtNumber(m["Primary_Metric_Score"])}`);
scripts/evaluation/eval_framework/store.py
@@ -23,6 +23,18 @@ class QueryBuildResult: @@ -23,6 +23,18 @@ class QueryBuildResult:
23 output_json_path: Path 23 output_json_path: Path
24 24
25 25
  26 +def _compact_batch_metadata(metadata: Dict[str, Any]) -> Dict[str, Any]:
  27 + return {
  28 + "batch_id": metadata.get("batch_id"),
  29 + "created_at": metadata.get("created_at"),
  30 + "tenant_id": metadata.get("tenant_id"),
  31 + "top_k": metadata.get("top_k"),
  32 + "query_count": len(metadata.get("queries") or []),
  33 + "aggregate_metrics": dict(metadata.get("aggregate_metrics") or {}),
  34 + "metric_context": dict(metadata.get("metric_context") or {}),
  35 + }
  36 +
  37 +
26 class EvalStore: 38 class EvalStore:
27 def __init__(self, db_path: Path): 39 def __init__(self, db_path: Path):
28 self.db_path = db_path 40 self.db_path = db_path
@@ -339,6 +351,7 @@ class EvalStore: @@ -339,6 +351,7 @@ class EvalStore:
339 ).fetchall() 351 ).fetchall()
340 items: List[Dict[str, Any]] = [] 352 items: List[Dict[str, Any]] = []
341 for row in rows: 353 for row in rows:
  354 + metadata = json.loads(row["metadata_json"])
342 items.append( 355 items.append(
343 { 356 {
344 "batch_id": row["batch_id"], 357 "batch_id": row["batch_id"],
@@ -346,7 +359,7 @@ class EvalStore: @@ -346,7 +359,7 @@ class EvalStore:
346 "output_json_path": row["output_json_path"], 359 "output_json_path": row["output_json_path"],
347 "report_markdown_path": row["report_markdown_path"], 360 "report_markdown_path": row["report_markdown_path"],
348 "config_snapshot_path": row["config_snapshot_path"], 361 "config_snapshot_path": row["config_snapshot_path"],
349 - "metadata": json.loads(row["metadata_json"]), 362 + "metadata": _compact_batch_metadata(metadata),
350 "created_at": row["created_at"], 363 "created_at": row["created_at"],
351 } 364 }
352 ) 365 )
scripts/evaluation/offline_ltr_fit.py
@@ -23,11 +23,11 @@ if str(PROJECT_ROOT) not in sys.path: @@ -23,11 +23,11 @@ if str(PROJECT_ROOT) not in sys.path:
23 23
24 from scripts.evaluation.eval_framework.constants import ( 24 from scripts.evaluation.eval_framework.constants import (
25 DEFAULT_ARTIFACT_ROOT, 25 DEFAULT_ARTIFACT_ROOT,
26 - RELEVANCE_EXACT,  
27 RELEVANCE_GRADE_MAP, 26 RELEVANCE_GRADE_MAP,
28 - RELEVANCE_HIGH,  
29 - RELEVANCE_IRRELEVANT,  
30 - RELEVANCE_LOW, 27 + RELEVANCE_LV0,
  28 + RELEVANCE_LV1,
  29 + RELEVANCE_LV2,
  30 + RELEVANCE_LV3,
31 ) 31 )
32 from scripts.evaluation.eval_framework.metrics import aggregate_metrics, compute_query_metrics 32 from scripts.evaluation.eval_framework.metrics import aggregate_metrics, compute_query_metrics
33 from scripts.evaluation.eval_framework.store import EvalStore 33 from scripts.evaluation.eval_framework.store import EvalStore
@@ -35,10 +35,10 @@ from scripts.evaluation.eval_framework.utils import ensure_dir, utc_timestamp @@ -35,10 +35,10 @@ from scripts.evaluation.eval_framework.utils import ensure_dir, utc_timestamp
35 35
36 36
37 LABELS_BY_GRADE = { 37 LABELS_BY_GRADE = {
38 - 3: RELEVANCE_EXACT,  
39 - 2: RELEVANCE_HIGH,  
40 - 1: RELEVANCE_LOW,  
41 - 0: RELEVANCE_IRRELEVANT, 38 + 3: RELEVANCE_LV3,
  39 + 2: RELEVANCE_LV2,
  40 + 1: RELEVANCE_LV1,
  41 + 0: RELEVANCE_LV0,
42 } 42 }
43 43
44 44
scripts/frontend/frontend_server.py 0 → 100755
@@ -0,0 +1,278 @@ @@ -0,0 +1,278 @@
  1 +#!/usr/bin/env python3
  2 +"""
  3 +Simple HTTP server for saas-search frontend.
  4 +"""
  5 +
  6 +import http.server
  7 +import socketserver
  8 +import os
  9 +import sys
  10 +import logging
  11 +import time
  12 +import urllib.request
  13 +import urllib.error
  14 +from collections import defaultdict, deque
  15 +from pathlib import Path
  16 +from dotenv import load_dotenv
  17 +
  18 +# Load .env file
  19 +project_root = Path(__file__).resolve().parents[2]
  20 +load_dotenv(project_root / '.env')
  21 +
  22 +# Get API_BASE_URL from environment(默认不注入,避免被旧 .env 覆盖同源策略)
  23 +# 仅当显式设置 FRONTEND_INJECT_API_BASE_URL=1 时才注入 window.API_BASE_URL。
  24 +API_BASE_URL = os.getenv('API_BASE_URL') or None
  25 +INJECT_API_BASE_URL = os.getenv('FRONTEND_INJECT_API_BASE_URL', '0') == '1'
  26 +# Backend proxy target for same-origin API forwarding
  27 +BACKEND_PROXY_URL = os.getenv('BACKEND_PROXY_URL', 'http://127.0.0.1:6002').rstrip('/')
  28 +
  29 +# Change to frontend directory
  30 +frontend_dir = os.path.join(project_root, 'frontend')
  31 +os.chdir(frontend_dir)
  32 +
  33 +# FRONTEND_PORT is the canonical config; keep PORT as a secondary fallback.
  34 +PORT = int(os.getenv('FRONTEND_PORT', os.getenv('PORT', 6003)))
  35 +
  36 +# Configure logging to suppress scanner noise
  37 +logging.basicConfig(level=logging.ERROR, format='%(asctime)s - %(levelname)s - %(message)s')
  38 +
  39 +class RateLimitingMixin:
  40 + """Mixin for rate limiting requests by IP address."""
  41 + request_counts = defaultdict(deque)
  42 + rate_limit = 100 # requests per minute
  43 + window = 60 # seconds
  44 +
  45 + @classmethod
  46 + def is_rate_limited(cls, ip):
  47 + now = time.time()
  48 +
  49 + # Clean old requests
  50 + while cls.request_counts[ip] and cls.request_counts[ip][0] < now - cls.window:
  51 + cls.request_counts[ip].popleft()
  52 +
  53 + # Check rate limit
  54 + if len(cls.request_counts[ip]) > cls.rate_limit:
  55 + return True
  56 +
  57 + cls.request_counts[ip].append(now)
  58 + return False
  59 +
  60 +class MyHTTPRequestHandler(http.server.SimpleHTTPRequestHandler, RateLimitingMixin):
  61 + """Custom request handler with CORS support and robust error handling."""
  62 +
  63 + _ALLOWED_CORS_HEADERS = "Content-Type, X-Tenant-ID, X-Request-ID, Referer"
  64 +
  65 + def _is_proxy_path(self, path: str) -> bool:
  66 + """Return True for API paths that should be forwarded to backend service."""
  67 + return path.startswith('/search/') or path.startswith('/admin/') or path.startswith('/indexer/')
  68 +
  69 + def _proxy_to_backend(self):
  70 + """Proxy current request to backend service on the GPU server."""
  71 + target_url = f"{BACKEND_PROXY_URL}{self.path}"
  72 + method = self.command.upper()
  73 +
  74 + try:
  75 + content_length = int(self.headers.get('Content-Length', '0'))
  76 + except ValueError:
  77 + content_length = 0
  78 + body = self.rfile.read(content_length) if content_length > 0 else None
  79 +
  80 + forward_headers = {}
  81 + for key, value in self.headers.items():
  82 + lk = key.lower()
  83 + if lk in ('host', 'content-length', 'connection'):
  84 + continue
  85 + forward_headers[key] = value
  86 +
  87 + req = urllib.request.Request(
  88 + target_url,
  89 + data=body,
  90 + headers=forward_headers,
  91 + method=method,
  92 + )
  93 +
  94 + try:
  95 + with urllib.request.urlopen(req, timeout=30) as resp:
  96 + resp_body = resp.read()
  97 + self.send_response(resp.getcode())
  98 + for header, value in resp.getheaders():
  99 + lh = header.lower()
  100 + if lh in ('transfer-encoding', 'connection', 'content-length'):
  101 + continue
  102 + self.send_header(header, value)
  103 + self.end_headers()
  104 + self.wfile.write(resp_body)
  105 + except urllib.error.HTTPError as e:
  106 + err_body = e.read() if hasattr(e, 'read') else b''
  107 + self.send_response(e.code)
  108 + if e.headers:
  109 + for header, value in e.headers.items():
  110 + lh = header.lower()
  111 + if lh in ('transfer-encoding', 'connection', 'content-length'):
  112 + continue
  113 + self.send_header(header, value)
  114 + self.end_headers()
  115 + if err_body:
  116 + self.wfile.write(err_body)
  117 + except Exception as e:
  118 + logging.error(f"Backend proxy error for {method} {self.path}: {e}")
  119 + self.send_response(502)
  120 + self.send_header('Content-Type', 'application/json; charset=utf-8')
  121 + self.end_headers()
  122 + self.wfile.write(b'{"error":"Bad Gateway: backend proxy failed"}')
  123 +
  124 + def do_GET(self):
  125 + """Handle GET requests with API config injection."""
  126 + path = self.path.split('?')[0]
  127 +
  128 + # Proxy API paths to backend first
  129 + if self._is_proxy_path(path):
  130 + self._proxy_to_backend()
  131 + return
  132 +
  133 + # Route / to index.html
  134 + if path == '/' or path == '':
  135 + self.path = '/index.html' + (self.path.split('?', 1)[1] if '?' in self.path else '')
  136 +
  137 + # Inject API config for HTML files
  138 + if self.path.endswith('.html'):
  139 + self._serve_html_with_config()
  140 + else:
  141 + super().do_GET()
  142 +
  143 + def _serve_html_with_config(self):
  144 + """Serve HTML with optional API_BASE_URL injected."""
  145 + try:
  146 + file_path = self.path.lstrip('/')
  147 + if not os.path.exists(file_path):
  148 + self.send_error(404)
  149 + return
  150 +
  151 + with open(file_path, 'r', encoding='utf-8') as f:
  152 + html = f.read()
  153 +
  154 + # 默认不注入 API_BASE_URL,避免历史 .env(如 http://xx:6002)覆盖同源调用。
  155 + # 仅当 FRONTEND_INJECT_API_BASE_URL=1 且 API_BASE_URL 有值时才注入。
  156 + if INJECT_API_BASE_URL and API_BASE_URL:
  157 + config_script = f'<script>window.API_BASE_URL="{API_BASE_URL}";</script>\n '
  158 + html = html.replace('<script src="/static/js/app.js', config_script + '<script src="/static/js/app.js', 1)
  159 +
  160 + self.send_response(200)
  161 + self.send_header('Content-Type', 'text/html; charset=utf-8')
  162 + self.end_headers()
  163 + self.wfile.write(html.encode('utf-8'))
  164 + except Exception as e:
  165 + logging.error(f"Error serving HTML: {e}")
  166 + self.send_error(500)
  167 +
  168 + def do_POST(self):
  169 + """Handle POST requests. Proxy API requests to backend."""
  170 + path = self.path.split('?')[0]
  171 + if self._is_proxy_path(path):
  172 + self._proxy_to_backend()
  173 + return
  174 + self.send_error(405, "Method Not Allowed")
  175 +
  176 + def setup(self):
  177 + """Setup with error handling."""
  178 + try:
  179 + super().setup()
  180 + except Exception:
  181 + pass # Silently handle setup errors from scanners
  182 +
  183 + def handle_one_request(self):
  184 + """Handle single request with error catching."""
  185 + try:
  186 + # Check rate limiting
  187 + client_ip = self.client_address[0]
  188 + if self.is_rate_limited(client_ip):
  189 + logging.warning(f"Rate limiting IP: {client_ip}")
  190 + self.send_error(429, "Too Many Requests")
  191 + return
  192 +
  193 + super().handle_one_request()
  194 + except (ConnectionResetError, BrokenPipeError):
  195 + # Client disconnected prematurely - common with scanners
  196 + pass
  197 + except UnicodeDecodeError:
  198 + # Binary data received - not HTTP
  199 + pass
  200 + except Exception as e:
  201 + # Log unexpected errors but don't crash
  202 + logging.debug(f"Request handling error: {e}")
  203 +
  204 + def log_message(self, format, *args):
  205 + """Suppress logging for malformed requests from scanners."""
  206 + message = format % args
  207 + # Filter out scanner noise
  208 + noise_patterns = [
  209 + "code 400",
  210 + "Bad request",
  211 + "Bad request version",
  212 + "Bad HTTP/0.9 request type",
  213 + "Bad request syntax"
  214 + ]
  215 + if any(pattern in message for pattern in noise_patterns):
  216 + return
  217 + # Only log legitimate requests
  218 + if message and not message.startswith(" ") and len(message) > 10:
  219 + super().log_message(format, *args)
  220 +
  221 + def end_headers(self):
  222 + # Add CORS headers
  223 + self.send_header('Access-Control-Allow-Origin', '*')
  224 + self.send_header('Access-Control-Allow-Methods', 'GET, POST, OPTIONS')
  225 + self.send_header('Access-Control-Allow-Headers', self._ALLOWED_CORS_HEADERS)
  226 + # Add security headers
  227 + self.send_header('X-Content-Type-Options', 'nosniff')
  228 + self.send_header('X-Frame-Options', 'DENY')
  229 + self.send_header('X-XSS-Protection', '1; mode=block')
  230 + super().end_headers()
  231 +
  232 + def do_OPTIONS(self):
  233 + """Handle OPTIONS requests."""
  234 + try:
  235 + path = self.path.split('?')[0]
  236 + if self._is_proxy_path(path):
  237 + self.send_response(204)
  238 + self.end_headers()
  239 + return
  240 + self.send_response(200)
  241 + self.end_headers()
  242 + except Exception:
  243 + pass
  244 +
  245 +class ThreadedTCPServer(socketserver.ThreadingMixIn, socketserver.TCPServer):
  246 + """Threaded TCP server with better error handling."""
  247 + allow_reuse_address = True
  248 + daemon_threads = True
  249 +
  250 +if __name__ == '__main__':
  251 + # Check if port is already in use
  252 + import socket
  253 + sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
  254 + try:
  255 + sock.bind(("", PORT))
  256 + sock.close()
  257 + except OSError:
  258 + print(f"ERROR: Port {PORT} is already in use.")
  259 + print(f"Please stop the existing server or use a different port.")
  260 + print(f"To stop existing server: kill $(lsof -t -i:{PORT})")
  261 + sys.exit(1)
  262 +
  263 + # Create threaded server for better concurrency
  264 + with ThreadedTCPServer(("", PORT), MyHTTPRequestHandler) as httpd:
  265 + print(f"Frontend server started at http://localhost:{PORT}")
  266 + print(f"Serving files from: {os.getcwd()}")
  267 + print("\nPress Ctrl+C to stop the server")
  268 +
  269 + try:
  270 + httpd.serve_forever()
  271 + except KeyboardInterrupt:
  272 + print("\nShutting down server...")
  273 + httpd.shutdown()
  274 + print("Server stopped")
  275 + sys.exit(0)
  276 + except Exception as e:
  277 + print(f"Server error: {e}")
  278 + sys.exit(1)
scripts/frontend_server.py 100755 → 100644
1 #!/usr/bin/env python3 1 #!/usr/bin/env python3
2 -"""  
3 -Simple HTTP server for saas-search frontend.  
4 -""" 2 +"""Backward-compatible frontend server entrypoint."""
5 3
6 -import http.server  
7 -import socketserver  
8 -import os  
9 -import sys  
10 -import logging  
11 -import time  
12 -import urllib.request  
13 -import urllib.error  
14 -from collections import defaultdict, deque  
15 -from pathlib import Path  
16 -from dotenv import load_dotenv  
17 -  
18 -# Load .env file  
19 -project_root = Path(__file__).parent.parent  
20 -load_dotenv(project_root / '.env')  
21 -  
22 -# Get API_BASE_URL from environment(默认不注入,避免被旧 .env 覆盖同源策略)  
23 -# 仅当显式设置 FRONTEND_INJECT_API_BASE_URL=1 时才注入 window.API_BASE_URL。  
24 -API_BASE_URL = os.getenv('API_BASE_URL') or None  
25 -INJECT_API_BASE_URL = os.getenv('FRONTEND_INJECT_API_BASE_URL', '0') == '1'  
26 -# Backend proxy target for same-origin API forwarding  
27 -BACKEND_PROXY_URL = os.getenv('BACKEND_PROXY_URL', 'http://127.0.0.1:6002').rstrip('/')  
28 -  
29 -# Change to frontend directory  
30 -frontend_dir = os.path.join(os.path.dirname(__file__), '../frontend')  
31 -os.chdir(frontend_dir)  
32 -  
33 -# FRONTEND_PORT is the canonical config; keep PORT as a secondary fallback.  
34 -PORT = int(os.getenv('FRONTEND_PORT', os.getenv('PORT', 6003)))  
35 -  
36 -# Configure logging to suppress scanner noise  
37 -logging.basicConfig(level=logging.ERROR, format='%(asctime)s - %(levelname)s - %(message)s')  
38 -  
39 -class RateLimitingMixin:  
40 - """Mixin for rate limiting requests by IP address."""  
41 - request_counts = defaultdict(deque)  
42 - rate_limit = 100 # requests per minute  
43 - window = 60 # seconds  
44 -  
45 - @classmethod  
46 - def is_rate_limited(cls, ip):  
47 - now = time.time()  
48 -  
49 - # Clean old requests  
50 - while cls.request_counts[ip] and cls.request_counts[ip][0] < now - cls.window:  
51 - cls.request_counts[ip].popleft()  
52 -  
53 - # Check rate limit  
54 - if len(cls.request_counts[ip]) > cls.rate_limit:  
55 - return True  
56 -  
57 - cls.request_counts[ip].append(now)  
58 - return False  
59 -  
60 -class MyHTTPRequestHandler(http.server.SimpleHTTPRequestHandler, RateLimitingMixin):  
61 - """Custom request handler with CORS support and robust error handling."""  
62 -  
63 - def _is_proxy_path(self, path: str) -> bool:  
64 - """Return True for API paths that should be forwarded to backend service."""  
65 - return path.startswith('/search/') or path.startswith('/admin/') or path.startswith('/indexer/')  
66 -  
67 - def _proxy_to_backend(self):  
68 - """Proxy current request to backend service on the GPU server."""  
69 - target_url = f"{BACKEND_PROXY_URL}{self.path}"  
70 - method = self.command.upper()  
71 -  
72 - try:  
73 - content_length = int(self.headers.get('Content-Length', '0'))  
74 - except ValueError:  
75 - content_length = 0  
76 - body = self.rfile.read(content_length) if content_length > 0 else None 4 +from __future__ import annotations
77 5
78 - forward_headers = {}  
79 - for key, value in self.headers.items():  
80 - lk = key.lower()  
81 - if lk in ('host', 'content-length', 'connection'):  
82 - continue  
83 - forward_headers[key] = value  
84 -  
85 - req = urllib.request.Request(  
86 - target_url,  
87 - data=body,  
88 - headers=forward_headers,  
89 - method=method,  
90 - )  
91 -  
92 - try:  
93 - with urllib.request.urlopen(req, timeout=30) as resp:  
94 - resp_body = resp.read()  
95 - self.send_response(resp.getcode())  
96 - for header, value in resp.getheaders():  
97 - lh = header.lower()  
98 - if lh in ('transfer-encoding', 'connection', 'content-length'):  
99 - continue  
100 - self.send_header(header, value)  
101 - self.end_headers()  
102 - self.wfile.write(resp_body)  
103 - except urllib.error.HTTPError as e:  
104 - err_body = e.read() if hasattr(e, 'read') else b''  
105 - self.send_response(e.code)  
106 - if e.headers:  
107 - for header, value in e.headers.items():  
108 - lh = header.lower()  
109 - if lh in ('transfer-encoding', 'connection', 'content-length'):  
110 - continue  
111 - self.send_header(header, value)  
112 - self.end_headers()  
113 - if err_body:  
114 - self.wfile.write(err_body)  
115 - except Exception as e:  
116 - logging.error(f"Backend proxy error for {method} {self.path}: {e}")  
117 - self.send_response(502)  
118 - self.send_header('Content-Type', 'application/json; charset=utf-8')  
119 - self.end_headers()  
120 - self.wfile.write(b'{"error":"Bad Gateway: backend proxy failed"}')  
121 -  
122 - def do_GET(self):  
123 - """Handle GET requests with API config injection."""  
124 - path = self.path.split('?')[0]  
125 -  
126 - # Proxy API paths to backend first  
127 - if self._is_proxy_path(path):  
128 - self._proxy_to_backend()  
129 - return  
130 -  
131 - # Route / to index.html  
132 - if path == '/' or path == '':  
133 - self.path = '/index.html' + (self.path.split('?', 1)[1] if '?' in self.path else '')  
134 -  
135 - # Inject API config for HTML files  
136 - if self.path.endswith('.html'):  
137 - self._serve_html_with_config()  
138 - else:  
139 - super().do_GET()  
140 -  
141 - def _serve_html_with_config(self):  
142 - """Serve HTML with optional API_BASE_URL injected."""  
143 - try:  
144 - file_path = self.path.lstrip('/')  
145 - if not os.path.exists(file_path):  
146 - self.send_error(404)  
147 - return  
148 -  
149 - with open(file_path, 'r', encoding='utf-8') as f:  
150 - html = f.read()  
151 -  
152 - # 默认不注入 API_BASE_URL,避免历史 .env(如 http://xx:6002)覆盖同源调用。  
153 - # 仅当 FRONTEND_INJECT_API_BASE_URL=1 且 API_BASE_URL 有值时才注入。  
154 - if INJECT_API_BASE_URL and API_BASE_URL:  
155 - config_script = f'<script>window.API_BASE_URL="{API_BASE_URL}";</script>\n '  
156 - html = html.replace('<script src="/static/js/app.js', config_script + '<script src="/static/js/app.js', 1)  
157 -  
158 - self.send_response(200)  
159 - self.send_header('Content-Type', 'text/html; charset=utf-8')  
160 - self.end_headers()  
161 - self.wfile.write(html.encode('utf-8'))  
162 - except Exception as e:  
163 - logging.error(f"Error serving HTML: {e}")  
164 - self.send_error(500)  
165 -  
166 - def do_POST(self):  
167 - """Handle POST requests. Proxy API requests to backend."""  
168 - path = self.path.split('?')[0]  
169 - if self._is_proxy_path(path):  
170 - self._proxy_to_backend()  
171 - return  
172 - self.send_error(405, "Method Not Allowed")  
173 -  
174 - def setup(self):  
175 - """Setup with error handling."""  
176 - try:  
177 - super().setup()  
178 - except Exception:  
179 - pass # Silently handle setup errors from scanners  
180 -  
181 - def handle_one_request(self):  
182 - """Handle single request with error catching."""  
183 - try:  
184 - # Check rate limiting  
185 - client_ip = self.client_address[0]  
186 - if self.is_rate_limited(client_ip):  
187 - logging.warning(f"Rate limiting IP: {client_ip}")  
188 - self.send_error(429, "Too Many Requests")  
189 - return  
190 -  
191 - super().handle_one_request()  
192 - except (ConnectionResetError, BrokenPipeError):  
193 - # Client disconnected prematurely - common with scanners  
194 - pass  
195 - except UnicodeDecodeError:  
196 - # Binary data received - not HTTP  
197 - pass  
198 - except Exception as e:  
199 - # Log unexpected errors but don't crash  
200 - logging.debug(f"Request handling error: {e}")  
201 -  
202 - def log_message(self, format, *args):  
203 - """Suppress logging for malformed requests from scanners."""  
204 - message = format % args  
205 - # Filter out scanner noise  
206 - noise_patterns = [  
207 - "code 400",  
208 - "Bad request",  
209 - "Bad request version",  
210 - "Bad HTTP/0.9 request type",  
211 - "Bad request syntax"  
212 - ]  
213 - if any(pattern in message for pattern in noise_patterns):  
214 - return  
215 - # Only log legitimate requests  
216 - if message and not message.startswith(" ") and len(message) > 10:  
217 - super().log_message(format, *args)  
218 -  
219 - def end_headers(self):  
220 - # Add CORS headers  
221 - self.send_header('Access-Control-Allow-Origin', '*')  
222 - self.send_header('Access-Control-Allow-Methods', 'GET, POST, OPTIONS')  
223 - self.send_header('Access-Control-Allow-Headers', 'Content-Type')  
224 - # Add security headers  
225 - self.send_header('X-Content-Type-Options', 'nosniff')  
226 - self.send_header('X-Frame-Options', 'DENY')  
227 - self.send_header('X-XSS-Protection', '1; mode=block')  
228 - super().end_headers()  
229 -  
230 - def do_OPTIONS(self):  
231 - """Handle OPTIONS requests."""  
232 - try:  
233 - path = self.path.split('?')[0]  
234 - if self._is_proxy_path(path):  
235 - self.send_response(204)  
236 - self.end_headers()  
237 - return  
238 - self.send_response(200)  
239 - self.end_headers()  
240 - except Exception:  
241 - pass  
242 -  
243 -class ThreadedTCPServer(socketserver.ThreadingMixIn, socketserver.TCPServer):  
244 - """Threaded TCP server with better error handling."""  
245 - allow_reuse_address = True  
246 - daemon_threads = True 6 +import runpy
  7 +from pathlib import Path
247 8
248 -if __name__ == '__main__':  
249 - # Check if port is already in use  
250 - import socket  
251 - sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)  
252 - try:  
253 - sock.bind(("", PORT))  
254 - sock.close()  
255 - except OSError:  
256 - print(f"ERROR: Port {PORT} is already in use.")  
257 - print(f"Please stop the existing server or use a different port.")  
258 - print(f"To stop existing server: kill $(lsof -t -i:{PORT})")  
259 - sys.exit(1)  
260 -  
261 - # Create threaded server for better concurrency  
262 - with ThreadedTCPServer(("", PORT), MyHTTPRequestHandler) as httpd:  
263 - print(f"Frontend server started at http://localhost:{PORT}")  
264 - print(f"Serving files from: {os.getcwd()}")  
265 - print("\nPress Ctrl+C to stop the server")  
266 9
267 - try:  
268 - httpd.serve_forever()  
269 - except KeyboardInterrupt:  
270 - print("\nShutting down server...")  
271 - httpd.shutdown()  
272 - print("Server stopped")  
273 - sys.exit(0)  
274 - except Exception as e:  
275 - print(f"Server error: {e}")  
276 - sys.exit(1) 10 +if __name__ == "__main__":
  11 + target = Path(__file__).resolve().parent / "frontend" / "frontend_server.py"
  12 + runpy.run_path(str(target), run_name="__main__")
scripts/inspect/README.md 0 → 100644
@@ -0,0 +1,10 @@ @@ -0,0 +1,10 @@
  1 +# Inspect Scripts
  2 +
  3 +这一组脚本用于做一次性诊断、索引检查和数据核对:
  4 +
  5 +- `check_data_source.py`
  6 +- `check_es_data.py`
  7 +- `check_index_mapping.py`
  8 +- `compare_index_mappings.py`
  9 +
  10 +它们依赖真实 DB / ES 环境,不属于 CI 测试或 benchmark。
scripts/check_data_source.py renamed to scripts/inspect/check_data_source.py
@@ -14,8 +14,8 @@ import argparse @@ -14,8 +14,8 @@ import argparse
14 from pathlib import Path 14 from pathlib import Path
15 from sqlalchemy import create_engine, text 15 from sqlalchemy import create_engine, text
16 16
17 -# Add parent directory to path  
18 -sys.path.insert(0, str(Path(__file__).parent.parent)) 17 +# Add repo root to path
  18 +sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
19 19
20 from utils.db_connector import create_db_connection 20 from utils.db_connector import create_db_connection
21 21
@@ -298,4 +298,3 @@ def main(): @@ -298,4 +298,3 @@ def main():
298 298
299 if __name__ == '__main__': 299 if __name__ == '__main__':
300 sys.exit(main()) 300 sys.exit(main())
301 -  
scripts/check_es_data.py renamed to scripts/inspect/check_es_data.py
@@ -8,7 +8,7 @@ import os @@ -8,7 +8,7 @@ import os
8 import argparse 8 import argparse
9 from pathlib import Path 9 from pathlib import Path
10 10
11 -sys.path.insert(0, str(Path(__file__).parent.parent)) 11 +sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
12 12
13 from utils.es_client import ESClient 13 from utils.es_client import ESClient
14 14
@@ -265,4 +265,3 @@ def main(): @@ -265,4 +265,3 @@ def main():
265 265
266 if __name__ == '__main__': 266 if __name__ == '__main__':
267 sys.exit(main()) 267 sys.exit(main())
268 -  
scripts/check_index_mapping.py renamed to scripts/inspect/check_index_mapping.py
@@ -8,7 +8,7 @@ import sys @@ -8,7 +8,7 @@ import sys
8 import json 8 import json
9 from pathlib import Path 9 from pathlib import Path
10 10
11 -sys.path.insert(0, str(Path(__file__).parent.parent)) 11 +sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
12 12
13 from utils.es_client import get_es_client_from_env 13 from utils.es_client import get_es_client_from_env
14 from indexer.mapping_generator import get_tenant_index_name 14 from indexer.mapping_generator import get_tenant_index_name
scripts/compare_index_mappings.py renamed to scripts/inspect/compare_index_mappings.py
@@ -9,7 +9,7 @@ import json @@ -9,7 +9,7 @@ import json
9 from pathlib import Path 9 from pathlib import Path
10 from typing import Dict, Any 10 from typing import Dict, Any
11 11
12 -sys.path.insert(0, str(Path(__file__).parent.parent)) 12 +sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
13 13
14 from utils.es_client import get_es_client_from_env 14 from utils.es_client import get_es_client_from_env
15 15
@@ -186,4 +186,3 @@ def main(): @@ -186,4 +186,3 @@ def main():
186 186
187 if __name__ == '__main__': 187 if __name__ == '__main__':
188 sys.exit(main()) 188 sys.exit(main())
189 -  
scripts/temp_embed_tenant_image_urls.py renamed to scripts/maintenance/embed_tenant_image_urls.py
@@ -5,7 +5,7 @@ @@ -5,7 +5,7 @@
5 5
6 用法: 6 用法:
7 source activate.sh # 会加载 .env,提供 ES_HOST / ES_USERNAME / ES_PASSWORD 7 source activate.sh # 会加载 .env,提供 ES_HOST / ES_USERNAME / ES_PASSWORD
8 - python scripts/temp_embed_tenant_image_urls.py 8 + python scripts/maintenance/embed_tenant_image_urls.py
9 9
10 未 source 时脚本也会尝试加载项目根目录 .env。 10 未 source 时脚本也会尝试加载项目根目录 .env。
11 """ 11 """
@@ -30,7 +30,7 @@ from elasticsearch.helpers import scan @@ -30,7 +30,7 @@ from elasticsearch.helpers import scan
30 try: 30 try:
31 from dotenv import load_dotenv 31 from dotenv import load_dotenv
32 32
33 - _ROOT = Path(__file__).resolve().parents[1] 33 + _ROOT = Path(__file__).resolve().parents[2]
34 load_dotenv(_ROOT / ".env") 34 load_dotenv(_ROOT / ".env")
35 except ImportError: 35 except ImportError:
36 pass 36 pass
scripts/ops/README.md 0 → 100644
@@ -0,0 +1,8 @@ @@ -0,0 +1,8 @@
  1 +# Ops Scripts
  2 +
  3 +这一组脚本是服务编排过程中的辅助脚本:
  4 +
  5 +- `daily_log_router.sh`:按天切日志
  6 +- `wechat_alert.py`:监控告警发送
  7 +
  8 +如果其他启动脚本引用这些文件,应通过这里的固定路径,不要再复制出新的同类工具。
scripts/daily_log_router.sh renamed to scripts/ops/daily_log_router.sh
@@ -3,7 +3,7 @@ @@ -3,7 +3,7 @@
3 # Route incoming log stream into per-day files. 3 # Route incoming log stream into per-day files.
4 # 4 #
5 # Usage: 5 # Usage:
6 -# command 2>&1 | ./scripts/daily_log_router.sh <service> <log_dir> [retention_days] 6 +# command 2>&1 | ./scripts/ops/daily_log_router.sh <service> <log_dir> [retention_days]
7 # 7 #
8 8
9 set -euo pipefail 9 set -euo pipefail
scripts/wechat_alert.py renamed to scripts/ops/wechat_alert.py
@@ -6,7 +6,7 @@ This module is intentionally small and focused so that Bash-based monitors @@ -6,7 +6,7 @@ This module is intentionally small and focused so that Bash-based monitors
6 can invoke it without pulling in the full application stack. 6 can invoke it without pulling in the full application stack.
7 7
8 Usage example: 8 Usage example:
9 - python scripts/wechat_alert.py --service backend --level error --message "backend restarted" 9 + python scripts/ops/wechat_alert.py --service backend --level error --message "backend restarted"
10 """ 10 """
11 11
12 import argparse 12 import argparse
@@ -101,4 +101,3 @@ def main(argv: list[str] | None = None) -&gt; int: @@ -101,4 +101,3 @@ def main(argv: list[str] | None = None) -&gt; int:
101 101
102 if __name__ == "__main__": 102 if __name__ == "__main__":
103 raise SystemExit(main()) 103 raise SystemExit(main())
104 -  
scripts/monitor_eviction.py renamed to scripts/redis/monitor_eviction.py
@@ -12,7 +12,7 @@ from pathlib import Path @@ -12,7 +12,7 @@ from pathlib import Path
12 from datetime import datetime 12 from datetime import datetime
13 13
14 # 添加项目路径 14 # 添加项目路径
15 -project_root = Path(__file__).parent.parent 15 +project_root = Path(__file__).resolve().parents[2]
16 sys.path.insert(0, str(project_root)) 16 sys.path.insert(0, str(project_root))
17 17
18 from config.env_config import REDIS_CONFIG 18 from config.env_config import REDIS_CONFIG
scripts/service_ctl.sh
@@ -20,6 +20,7 @@ CORE_SERVICES=(&quot;backend&quot; &quot;indexer&quot; &quot;frontend&quot; &quot;eval-web&quot;) @@ -20,6 +20,7 @@ CORE_SERVICES=(&quot;backend&quot; &quot;indexer&quot; &quot;frontend&quot; &quot;eval-web&quot;)
20 OPTIONAL_SERVICES=("tei" "cnclip" "embedding" "embedding-image" "translator" "reranker") 20 OPTIONAL_SERVICES=("tei" "cnclip" "embedding" "embedding-image" "translator" "reranker")
21 FULL_SERVICES=("${OPTIONAL_SERVICES[@]}" "${CORE_SERVICES[@]}") 21 FULL_SERVICES=("${OPTIONAL_SERVICES[@]}" "${CORE_SERVICES[@]}")
22 STOP_ORDER_SERVICES=("frontend" "eval-web" "indexer" "backend" "reranker" "translator" "embedding-image" "embedding" "cnclip" "tei") 22 STOP_ORDER_SERVICES=("frontend" "eval-web" "indexer" "backend" "reranker" "translator" "embedding-image" "embedding" "cnclip" "tei")
  23 +declare -Ag SERVICE_ENABLED_CACHE=()
23 24
24 all_services() { 25 all_services() {
25 echo "${FULL_SERVICES[@]}" 26 echo "${FULL_SERVICES[@]}"
@@ -33,6 +34,72 @@ config_python_bin() { @@ -33,6 +34,72 @@ config_python_bin() {
33 fi 34 fi
34 } 35 }
35 36
  37 +service_enabled_by_config() {
  38 + local service="$1"
  39 + case "${service}" in
  40 + reranker|reranker-fine|translator)
  41 + ;;
  42 + *)
  43 + return 0
  44 + ;;
  45 + esac
  46 +
  47 + if [ -n "${SERVICE_ENABLED_CACHE[${service}]+x}" ]; then
  48 + [ "${SERVICE_ENABLED_CACHE[${service}]}" = "1" ]
  49 + return
  50 + fi
  51 +
  52 + local pybin
  53 + pybin="$(config_python_bin)"
  54 +
  55 + local enabled
  56 + if ! enabled="$(
  57 + SERVICE_NAME="${service}" \
  58 + PYTHONPATH="${PROJECT_ROOT}${PYTHONPATH:+:${PYTHONPATH}}" \
  59 + "${pybin}" - <<'PY'
  60 +from config.loader import get_app_config
  61 +import os
  62 +
  63 +service = os.environ["SERVICE_NAME"]
  64 +cfg = get_app_config()
  65 +
  66 +enabled = True
  67 +if service == "reranker":
  68 + enabled = bool(cfg.search.rerank.enabled)
  69 +elif service == "reranker-fine":
  70 + enabled = bool(cfg.search.fine_rank.enabled)
  71 +elif service == "translator":
  72 + capabilities = dict(cfg.services.translation.capabilities or {})
  73 + enabled = any(bool((value or {}).get("enabled", True)) for value in capabilities.values())
  74 +
  75 +print("1" if enabled else "0")
  76 +PY
  77 + )"; then
  78 + echo "[warn] failed to read config state for ${service}; defaulting to enabled" >&2
  79 + enabled="1"
  80 + fi
  81 +
  82 + SERVICE_ENABLED_CACHE["${service}"]="${enabled}"
  83 + [ "${enabled}" = "1" ]
  84 +}
  85 +
  86 +filter_disabled_targets() {
  87 + local targets="$1"
  88 + local verbose="${2:-quiet}"
  89 + local out=""
  90 + local svc
  91 +
  92 + for svc in ${targets}; do
  93 + if service_enabled_by_config "${svc}"; then
  94 + out="${out} ${svc}"
  95 + elif [ "${verbose}" = "verbose" ]; then
  96 + echo "[skip] ${svc} disabled by config" >&2
  97 + fi
  98 + done
  99 +
  100 + echo "${out# }"
  101 +}
  102 +
36 reranker_instance_for_service() { 103 reranker_instance_for_service() {
37 local service="$1" 104 local service="$1"
38 case "${service}" in 105 case "${service}" in
@@ -334,7 +401,7 @@ monitor_services() { @@ -334,7 +401,7 @@ monitor_services() {
334 local fail_threshold="${MONITOR_FAIL_THRESHOLD:-3}" 401 local fail_threshold="${MONITOR_FAIL_THRESHOLD:-3}"
335 local restart_cooldown_sec="${MONITOR_RESTART_COOLDOWN_SEC:-30}" 402 local restart_cooldown_sec="${MONITOR_RESTART_COOLDOWN_SEC:-30}"
336 local max_restarts_per_hour="${MONITOR_MAX_RESTARTS_PER_HOUR:-6}" 403 local max_restarts_per_hour="${MONITOR_MAX_RESTARTS_PER_HOUR:-6}"
337 - local wechat_alert_py="${PROJECT_ROOT}/scripts/wechat_alert.py" 404 + local wechat_alert_py="${PROJECT_ROOT}/scripts/ops/wechat_alert.py"
338 405
339 require_positive_int "MONITOR_INTERVAL_SEC" "${interval_sec}" 406 require_positive_int "MONITOR_INTERVAL_SEC" "${interval_sec}"
340 require_positive_int "MONITOR_FAIL_THRESHOLD" "${fail_threshold}" 407 require_positive_int "MONITOR_FAIL_THRESHOLD" "${fail_threshold}"
@@ -468,6 +535,16 @@ stop_monitor_daemon() { @@ -468,6 +535,16 @@ stop_monitor_daemon() {
468 535
469 start_monitor_daemon() { 536 start_monitor_daemon() {
470 local targets="$1" 537 local targets="$1"
  538 + if [ -z "${targets}" ]; then
  539 + if is_monitor_daemon_running; then
  540 + echo "[info] no enabled services to monitor; stopping monitor daemon"
  541 + stop_monitor_daemon
  542 + else
  543 + echo "[info] no enabled services to monitor"
  544 + fi
  545 + return 0
  546 + fi
  547 +
471 local pf 548 local pf
472 pf="$(monitor_pid_file)" 549 pf="$(monitor_pid_file)"
473 local tf 550 local tf
@@ -581,6 +658,10 @@ wait_for_startup_health() { @@ -581,6 +658,10 @@ wait_for_startup_health() {
581 start_one() { 658 start_one() {
582 local service="$1" 659 local service="$1"
583 cd "${PROJECT_ROOT}" 660 cd "${PROJECT_ROOT}"
  661 + if ! service_enabled_by_config "${service}"; then
  662 + echo "[skip] ${service} disabled by config"
  663 + return 0
  664 + fi
584 local cmd 665 local cmd
585 if ! cmd="$(service_start_cmd "${service}")"; then 666 if ! cmd="$(service_start_cmd "${service}")"; then
586 echo "[error] unknown service: ${service}" >&2 667 echo "[error] unknown service: ${service}" >&2
@@ -953,6 +1034,7 @@ main() { @@ -953,6 +1034,7 @@ main() {
953 1034
954 load_env_file "${PROJECT_ROOT}/.env" 1035 load_env_file "${PROJECT_ROOT}/.env"
955 local targets="" 1036 local targets=""
  1037 + local effective_targets=""
956 local monitor_was_running=0 1038 local monitor_was_running=0
957 local monitor_prev_targets="" 1039 local monitor_prev_targets=""
958 local auto_monitor_on_start="${SERVICE_CTL_AUTO_MONITOR_ON_START:-1}" 1040 local auto_monitor_on_start="${SERVICE_CTL_AUTO_MONITOR_ON_START:-1}"
@@ -976,12 +1058,23 @@ main() { @@ -976,12 +1058,23 @@ main() {
976 ;; 1058 ;;
977 esac 1059 esac
978 1060
  1061 + effective_targets="${targets}"
  1062 + case "${action}" in
  1063 + up|start|restart|monitor|monitor-start)
  1064 + effective_targets="$(filter_disabled_targets "${targets}" "verbose")"
  1065 + ;;
  1066 + esac
  1067 +
979 case "${action}" in 1068 case "${action}" in
980 up) 1069 up)
981 - for svc in ${targets}; do 1070 + if [ -z "${effective_targets}" ]; then
  1071 + echo "[info] no enabled services in target set"
  1072 + exit 0
  1073 + fi
  1074 + for svc in ${effective_targets}; do
982 start_one "${svc}" 1075 start_one "${svc}"
983 done 1076 done
984 - start_monitor_daemon "${targets}" 1077 + start_monitor_daemon "${effective_targets}"
985 ;; 1078 ;;
986 down) 1079 down)
987 stop_monitor_daemon 1080 stop_monitor_daemon
@@ -990,11 +1083,15 @@ main() { @@ -990,11 +1083,15 @@ main() {
990 done 1083 done
991 ;; 1084 ;;
992 start) 1085 start)
993 - for svc in ${targets}; do 1086 + if [ -z "${effective_targets}" ]; then
  1087 + echo "[info] no enabled services in target set"
  1088 + exit 0
  1089 + fi
  1090 + for svc in ${effective_targets}; do
994 start_one "${svc}" 1091 start_one "${svc}"
995 done 1092 done
996 if [ "${auto_monitor_on_start}" = "1" ]; then 1093 if [ "${auto_monitor_on_start}" = "1" ]; then
997 - start_monitor_daemon "$(merge_targets "$(monitor_current_targets)" "${targets}")" 1094 + start_monitor_daemon "$(merge_targets "$(monitor_current_targets)" "${effective_targets}")"
998 fi 1095 fi
999 ;; 1096 ;;
1000 stop) 1097 stop)
@@ -1025,16 +1122,17 @@ main() { @@ -1025,16 +1122,17 @@ main() {
1025 for svc in ${restart_stop_targets}; do 1122 for svc in ${restart_stop_targets}; do
1026 stop_one "${svc}" 1123 stop_one "${svc}"
1027 done 1124 done
1028 - for svc in ${targets}; do 1125 + for svc in ${effective_targets}; do
1029 start_one "${svc}" 1126 start_one "${svc}"
1030 done 1127 done
1031 if [ "${monitor_was_running}" -eq 1 ]; then 1128 if [ "${monitor_was_running}" -eq 1 ]; then
1032 monitor_prev_targets="$(normalize_targets "${monitor_prev_targets}")" 1129 monitor_prev_targets="$(normalize_targets "${monitor_prev_targets}")"
  1130 + monitor_prev_targets="$(filter_disabled_targets "${monitor_prev_targets}" "quiet")"
1033 monitor_prev_targets="$(apply_target_order monitor "${monitor_prev_targets}")" 1131 monitor_prev_targets="$(apply_target_order monitor "${monitor_prev_targets}")"
1034 - [ -z "${monitor_prev_targets}" ] && monitor_prev_targets="${targets}" 1132 + [ -z "${monitor_prev_targets}" ] && monitor_prev_targets="${effective_targets}"
1035 start_monitor_daemon "${monitor_prev_targets}" 1133 start_monitor_daemon "${monitor_prev_targets}"
1036 elif [ "${auto_monitor_on_start}" = "1" ]; then 1134 elif [ "${auto_monitor_on_start}" = "1" ]; then
1037 - start_monitor_daemon "$(merge_targets "$(monitor_current_targets)" "${targets}")" 1135 + start_monitor_daemon "$(merge_targets "$(monitor_current_targets)" "${effective_targets}")"
1038 fi 1136 fi
1039 ;; 1137 ;;
1040 status) 1138 status)
@@ -1044,10 +1142,14 @@ main() { @@ -1044,10 +1142,14 @@ main() {
1044 monitor_daemon_status 1142 monitor_daemon_status
1045 ;; 1143 ;;
1046 monitor) 1144 monitor)
1047 - monitor_services "${targets}" 1145 + if [ -z "${effective_targets}" ]; then
  1146 + echo "[info] no enabled services in target set"
  1147 + exit 0
  1148 + fi
  1149 + monitor_services "${effective_targets}"
1048 ;; 1150 ;;
1049 monitor-start) 1151 monitor-start)
1050 - start_monitor_daemon "${targets}" 1152 + start_monitor_daemon "${effective_targets}"
1051 ;; 1153 ;;
1052 monitor-stop) 1154 monitor-stop)
1053 stop_monitor_daemon 1155 stop_monitor_daemon
scripts/setup_translator_venv.sh
@@ -8,8 +8,47 @@ PROJECT_ROOT=&quot;$(cd &quot;$(dirname &quot;$0&quot;)/..&quot; &amp;&amp; pwd)&quot; @@ -8,8 +8,47 @@ PROJECT_ROOT=&quot;$(cd &quot;$(dirname &quot;$0&quot;)/..&quot; &amp;&amp; pwd)&quot;
8 cd "${PROJECT_ROOT}" 8 cd "${PROJECT_ROOT}"
9 9
10 VENV_DIR="${PROJECT_ROOT}/.venv-translator" 10 VENV_DIR="${PROJECT_ROOT}/.venv-translator"
11 -PYTHON_BIN="${PYTHON_BIN:-python3}"  
12 TMP_DIR="${TRANSLATOR_PIP_TMPDIR:-${PROJECT_ROOT}/.tmp/translator-pip}" 11 TMP_DIR="${TRANSLATOR_PIP_TMPDIR:-${PROJECT_ROOT}/.tmp/translator-pip}"
  12 +MIN_PYTHON_MAJOR=3
  13 +MIN_PYTHON_MINOR=10
  14 +
  15 +python_meets_minimum() {
  16 + local bin="$1"
  17 + "${bin}" - <<'PY' "${MIN_PYTHON_MAJOR}" "${MIN_PYTHON_MINOR}"
  18 +import sys
  19 +
  20 +required = tuple(int(value) for value in sys.argv[1:])
  21 +sys.exit(0 if sys.version_info[:2] >= required else 1)
  22 +PY
  23 +}
  24 +
  25 +discover_python_bin() {
  26 + local candidates=()
  27 +
  28 + if [[ -n "${PYTHON_BIN:-}" ]]; then
  29 + candidates+=("${PYTHON_BIN}")
  30 + fi
  31 + candidates+=("python3.12" "python3.11" "python3.10" "python3")
  32 +
  33 + local candidate
  34 + for candidate in "${candidates[@]}"; do
  35 + if ! command -v "${candidate}" >/dev/null 2>&1; then
  36 + continue
  37 + fi
  38 + if python_meets_minimum "${candidate}"; then
  39 + echo "${candidate}"
  40 + return 0
  41 + fi
  42 + done
  43 +
  44 + return 1
  45 +}
  46 +
  47 +if ! PYTHON_BIN="$(discover_python_bin)"; then
  48 + echo "ERROR: unable to find Python >= ${MIN_PYTHON_MAJOR}.${MIN_PYTHON_MINOR}." >&2
  49 + echo "Set PYTHON_BIN to a compatible interpreter and rerun." >&2
  50 + exit 1
  51 +fi
13 52
14 if ! command -v "${PYTHON_BIN}" >/dev/null 2>&1; then 53 if ! command -v "${PYTHON_BIN}" >/dev/null 2>&1; then
15 echo "ERROR: python not found: ${PYTHON_BIN}" >&2 54 echo "ERROR: python not found: ${PYTHON_BIN}" >&2
@@ -32,6 +71,7 @@ mkdir -p &quot;${TMP_DIR}&quot; @@ -32,6 +71,7 @@ mkdir -p &quot;${TMP_DIR}&quot;
32 export TMPDIR="${TMP_DIR}" 71 export TMPDIR="${TMP_DIR}"
33 PIP_ARGS=(--no-cache-dir) 72 PIP_ARGS=(--no-cache-dir)
34 73
  74 +echo "Using Python=${PYTHON_BIN}"
35 echo "Using TMPDIR=${TMPDIR}" 75 echo "Using TMPDIR=${TMPDIR}"
36 "${VENV_DIR}/bin/python" -m pip install "${PIP_ARGS[@]}" --upgrade pip wheel 76 "${VENV_DIR}/bin/python" -m pip install "${PIP_ARGS[@]}" --upgrade pip wheel
37 "${VENV_DIR}/bin/python" -m pip install "${PIP_ARGS[@]}" -r requirements_translator_service.txt 77 "${VENV_DIR}/bin/python" -m pip install "${PIP_ARGS[@]}" -r requirements_translator_service.txt
@@ -39,5 +79,5 @@ echo &quot;Using TMPDIR=${TMPDIR}&quot; @@ -39,5 +79,5 @@ echo &quot;Using TMPDIR=${TMPDIR}&quot;
39 echo 79 echo
40 echo "Done." 80 echo "Done."
41 echo "Translator venv: ${VENV_DIR}" 81 echo "Translator venv: ${VENV_DIR}"
42 -echo "Download local models: ./.venv-translator/bin/python scripts/download_translation_models.py --all-local" 82 +echo "Download local models: ./.venv-translator/bin/python scripts/translation/download_translation_models.py --all-local"
43 echo "Start service: ./scripts/start_translator.sh" 83 echo "Start service: ./scripts/start_translator.sh"
scripts/start_cnclip_service.sh
@@ -61,7 +61,7 @@ LOG_DIR=&quot;${PROJECT_ROOT}/logs&quot; @@ -61,7 +61,7 @@ LOG_DIR=&quot;${PROJECT_ROOT}/logs&quot;
61 PID_FILE="${LOG_DIR}/cnclip.pid" 61 PID_FILE="${LOG_DIR}/cnclip.pid"
62 LOG_LINK="${LOG_DIR}/cnclip.log" 62 LOG_LINK="${LOG_DIR}/cnclip.log"
63 LOG_FILE="${LOG_DIR}/cnclip-$(date +%F).log" 63 LOG_FILE="${LOG_DIR}/cnclip-$(date +%F).log"
64 -LOG_ROUTER_SCRIPT="${PROJECT_ROOT}/scripts/daily_log_router.sh" 64 +LOG_ROUTER_SCRIPT="${PROJECT_ROOT}/scripts/ops/daily_log_router.sh"
65 65
66 # 帮助信息 66 # 帮助信息
67 show_help() { 67 show_help() {
scripts/start_frontend.sh
@@ -27,4 +27,4 @@ echo -e &quot; ${GREEN}http://localhost:${API_PORT}${NC}&quot; @@ -27,4 +27,4 @@ echo -e &quot; ${GREEN}http://localhost:${API_PORT}${NC}&quot;
27 echo "" 27 echo ""
28 28
29 export FRONTEND_PORT API_PORT PORT 29 export FRONTEND_PORT API_PORT PORT
30 -exec python scripts/frontend_server.py 30 +exec python scripts/frontend/frontend_server.py
scripts/translation/download_translation_models.py 0 → 100755
@@ -0,0 +1,100 @@ @@ -0,0 +1,100 @@
  1 +#!/usr/bin/env python3
  2 +"""Download local translation models declared in services.translation.capabilities."""
  3 +
  4 +from __future__ import annotations
  5 +
  6 +import argparse
  7 +import os
  8 +from pathlib import Path
  9 +import sys
  10 +from typing import Iterable
  11 +
  12 +from huggingface_hub import snapshot_download
  13 +
  14 +PROJECT_ROOT = Path(__file__).resolve().parents[2]
  15 +if str(PROJECT_ROOT) not in sys.path:
  16 + sys.path.insert(0, str(PROJECT_ROOT))
  17 +os.environ.setdefault("HF_HUB_DISABLE_XET", "1")
  18 +
  19 +from config.services_config import get_translation_config
  20 +from translation.ct2_conversion import convert_transformers_model
  21 +
  22 +
  23 +LOCAL_BACKENDS = {"local_nllb", "local_marian"}
  24 +
  25 +
  26 +def iter_local_capabilities(selected: set[str] | None = None) -> Iterable[tuple[str, dict]]:
  27 + cfg = get_translation_config()
  28 + capabilities = cfg.get("capabilities", {}) if isinstance(cfg, dict) else {}
  29 + for name, capability in capabilities.items():
  30 + backend = str(capability.get("backend") or "").strip().lower()
  31 + if backend not in LOCAL_BACKENDS:
  32 + continue
  33 + if selected and name not in selected:
  34 + continue
  35 + yield name, capability
  36 +
  37 +
  38 +def _compute_ct2_output_dir(capability: dict) -> Path:
  39 + custom = str(capability.get("ct2_model_dir") or "").strip()
  40 + if custom:
  41 + return Path(custom).expanduser()
  42 + model_dir = Path(str(capability.get("model_dir") or "")).expanduser()
  43 + compute_type = str(capability.get("ct2_compute_type") or capability.get("torch_dtype") or "default").strip().lower()
  44 + normalized = compute_type.replace("_", "-")
  45 + return model_dir / f"ctranslate2-{normalized}"
  46 +
  47 +
  48 +def convert_to_ctranslate2(name: str, capability: dict) -> None:
  49 + model_id = str(capability.get("model_id") or "").strip()
  50 + model_dir = Path(str(capability.get("model_dir") or "")).expanduser()
  51 + model_source = str(model_dir if model_dir.exists() else model_id)
  52 + output_dir = _compute_ct2_output_dir(capability)
  53 + if (output_dir / "model.bin").exists():
  54 + print(f"[skip-convert] {name} -> {output_dir}")
  55 + return
  56 + quantization = str(
  57 + capability.get("ct2_conversion_quantization")
  58 + or capability.get("ct2_compute_type")
  59 + or capability.get("torch_dtype")
  60 + or "default"
  61 + ).strip()
  62 + output_dir.parent.mkdir(parents=True, exist_ok=True)
  63 + print(f"[convert] {name} -> {output_dir} ({quantization})")
  64 + convert_transformers_model(model_source, str(output_dir), quantization)
  65 + print(f"[converted] {name}")
  66 +
  67 +
  68 +def main() -> None:
  69 + parser = argparse.ArgumentParser(description="Download local translation models")
  70 + parser.add_argument("--all-local", action="store_true", help="Download all configured local translation models")
  71 + parser.add_argument("--models", nargs="*", default=[], help="Specific capability names to download")
  72 + parser.add_argument(
  73 + "--convert-ctranslate2",
  74 + action="store_true",
  75 + help="Also convert the downloaded Hugging Face models into CTranslate2 format",
  76 + )
  77 + args = parser.parse_args()
  78 +
  79 + selected = {item.strip().lower() for item in args.models if item.strip()} or None
  80 + if not args.all_local and not selected:
  81 + parser.error("pass --all-local or --models <name> ...")
  82 +
  83 + for name, capability in iter_local_capabilities(selected):
  84 + model_id = str(capability.get("model_id") or "").strip()
  85 + model_dir = Path(str(capability.get("model_dir") or "")).expanduser()
  86 + if not model_id or not model_dir:
  87 + raise ValueError(f"Capability '{name}' must define model_id and model_dir")
  88 + model_dir.parent.mkdir(parents=True, exist_ok=True)
  89 + print(f"[download] {name} -> {model_dir} ({model_id})")
  90 + snapshot_download(
  91 + repo_id=model_id,
  92 + local_dir=str(model_dir),
  93 + )
  94 + print(f"[done] {name}")
  95 + if args.convert_ctranslate2:
  96 + convert_to_ctranslate2(name, capability)
  97 +
  98 +
  99 +if __name__ == "__main__":
  100 + main()
search/es_query_builder.py
@@ -8,6 +8,7 @@ Simplified architecture: @@ -8,6 +8,7 @@ Simplified architecture:
8 - function_score wrapper for boosting fields 8 - function_score wrapper for boosting fields
9 """ 9 """
10 10
  11 +from dataclasses import dataclass
11 from typing import Dict, Any, List, Optional, Tuple 12 from typing import Dict, Any, List, Optional, Tuple
12 13
13 import numpy as np 14 import numpy as np
@@ -114,6 +115,171 @@ class ESQueryBuilder: @@ -114,6 +115,171 @@ class ESQueryBuilder:
114 self.phrase_match_tie_breaker = float(phrase_match_tie_breaker) 115 self.phrase_match_tie_breaker = float(phrase_match_tie_breaker)
115 self.phrase_match_boost = float(phrase_match_boost) 116 self.phrase_match_boost = float(phrase_match_boost)
116 117
  118 + @dataclass(frozen=True)
  119 + class KNNClausePlan:
  120 + field: str
  121 + boost: float
  122 + k: Optional[int] = None
  123 + num_candidates: Optional[int] = None
  124 + nested_path: Optional[str] = None
  125 +
  126 + @staticmethod
  127 + def _vector_to_list(vector: Any) -> List[float]:
  128 + if vector is None:
  129 + return []
  130 + if hasattr(vector, "tolist"):
  131 + values = vector.tolist()
  132 + else:
  133 + values = list(vector)
  134 + return [float(v) for v in values]
  135 +
  136 + @staticmethod
  137 + def _query_token_count(parsed_query: Optional[Any]) -> int:
  138 + if parsed_query is None:
  139 + return 0
  140 + query_tokens = getattr(parsed_query, "query_tokens", None) or []
  141 + return len(query_tokens)
  142 +
  143 + def get_text_knn_plan(self, parsed_query: Optional[Any] = None) -> Optional[KNNClausePlan]:
  144 + if not self.text_embedding_field:
  145 + return None
  146 + boost = self.knn_text_boost
  147 + final_knn_k = self.knn_text_k
  148 + final_knn_num_candidates = self.knn_text_num_candidates
  149 + if self._query_token_count(parsed_query) >= 5:
  150 + final_knn_k = self.knn_text_k_long
  151 + final_knn_num_candidates = self.knn_text_num_candidates_long
  152 + boost = self.knn_text_boost * 1.4
  153 + return self.KNNClausePlan(
  154 + field=str(self.text_embedding_field),
  155 + boost=float(boost),
  156 + k=int(final_knn_k),
  157 + num_candidates=int(final_knn_num_candidates),
  158 + )
  159 +
  160 + def get_image_knn_plan(self) -> Optional[KNNClausePlan]:
  161 + if not self.image_embedding_field:
  162 + return None
  163 + nested_path, _, _ = str(self.image_embedding_field).rpartition(".")
  164 + return self.KNNClausePlan(
  165 + field=str(self.image_embedding_field),
  166 + boost=float(self.knn_image_boost),
  167 + k=int(self.knn_image_k),
  168 + num_candidates=int(self.knn_image_num_candidates),
  169 + nested_path=nested_path or None,
  170 + )
  171 +
  172 + def build_text_knn_clause(
  173 + self,
  174 + query_vector: Any,
  175 + *,
  176 + parsed_query: Optional[Any] = None,
  177 + query_name: str = "knn_query",
  178 + ) -> Optional[Dict[str, Any]]:
  179 + plan = self.get_text_knn_plan(parsed_query)
  180 + if plan is None or query_vector is None:
  181 + return None
  182 + return {
  183 + "knn": {
  184 + "field": plan.field,
  185 + "query_vector": self._vector_to_list(query_vector),
  186 + "k": plan.k,
  187 + "num_candidates": plan.num_candidates,
  188 + "boost": plan.boost,
  189 + "_name": query_name,
  190 + }
  191 + }
  192 +
  193 + def build_image_knn_clause(
  194 + self,
  195 + image_query_vector: Any,
  196 + *,
  197 + query_name: str = "image_knn_query",
  198 + ) -> Optional[Dict[str, Any]]:
  199 + plan = self.get_image_knn_plan()
  200 + if plan is None or image_query_vector is None:
  201 + return None
  202 + image_knn_query = {
  203 + "field": plan.field,
  204 + "query_vector": self._vector_to_list(image_query_vector),
  205 + "k": plan.k,
  206 + "num_candidates": plan.num_candidates,
  207 + "boost": plan.boost,
  208 + }
  209 + if plan.nested_path:
  210 + return {
  211 + "nested": {
  212 + "path": plan.nested_path,
  213 + "_name": query_name,
  214 + "query": {"knn": image_knn_query},
  215 + "score_mode": "max",
  216 + }
  217 + }
  218 + return {
  219 + "knn": {
  220 + **image_knn_query,
  221 + "_name": query_name,
  222 + }
  223 + }
  224 +
  225 + def build_exact_text_knn_rescore_clause(
  226 + self,
  227 + query_vector: Any,
  228 + *,
  229 + parsed_query: Optional[Any] = None,
  230 + query_name: str = "exact_text_knn_query",
  231 + ) -> Optional[Dict[str, Any]]:
  232 + plan = self.get_text_knn_plan(parsed_query)
  233 + if plan is None or query_vector is None:
  234 + return None
  235 + return {
  236 + "script_score": {
  237 + "_name": query_name,
  238 + "query": {"exists": {"field": plan.field}},
  239 + "script": {
  240 + "source": (
  241 + f"((dotProduct(params.query_vector, '{plan.field}') + 1.0) / 2.0) * params.boost"
  242 + ),
  243 + "params": {
  244 + "query_vector": self._vector_to_list(query_vector),
  245 + "boost": float(plan.boost),
  246 + },
  247 + },
  248 + }
  249 + }
  250 +
  251 + def build_exact_image_knn_rescore_clause(
  252 + self,
  253 + image_query_vector: Any,
  254 + *,
  255 + query_name: str = "exact_image_knn_query",
  256 + ) -> Optional[Dict[str, Any]]:
  257 + plan = self.get_image_knn_plan()
  258 + if plan is None or image_query_vector is None:
  259 + return None
  260 + script_score_query = {
  261 + "query": {"exists": {"field": plan.field}},
  262 + "script": {
  263 + "source": (
  264 + f"((dotProduct(params.query_vector, '{plan.field}') + 1.0) / 2.0) * params.boost"
  265 + ),
  266 + "params": {
  267 + "query_vector": self._vector_to_list(image_query_vector),
  268 + "boost": float(plan.boost),
  269 + },
  270 + },
  271 + }
  272 + if plan.nested_path:
  273 + return {
  274 + "nested": {
  275 + "path": plan.nested_path,
  276 + "_name": query_name,
  277 + "score_mode": "max",
  278 + "query": {"script_score": script_score_query},
  279 + }
  280 + }
  281 + return {"script_score": {"_name": query_name, **script_score_query}}
  282 +
117 def _apply_source_filter(self, es_query: Dict[str, Any]) -> None: 283 def _apply_source_filter(self, es_query: Dict[str, Any]) -> None:
118 """ 284 """
119 Apply tri-state _source semantics: 285 Apply tri-state _source semantics:
@@ -250,52 +416,21 @@ class ESQueryBuilder: @@ -250,52 +416,21 @@ class ESQueryBuilder:
250 # 3. Add KNN search clauses alongside lexical clauses under the same bool.should 416 # 3. Add KNN search clauses alongside lexical clauses under the same bool.should
251 # Text KNN: k / num_candidates from config; long queries use *_long and higher boost 417 # Text KNN: k / num_candidates from config; long queries use *_long and higher boost
252 if has_embedding: 418 if has_embedding:
253 - text_knn_boost = self.knn_text_boost  
254 - final_knn_k = self.knn_text_k  
255 - final_knn_num_candidates = self.knn_text_num_candidates  
256 - if parsed_query:  
257 - query_tokens = getattr(parsed_query, 'query_tokens', None) or []  
258 - token_count = len(query_tokens)  
259 - if token_count >= 5:  
260 - final_knn_k = self.knn_text_k_long  
261 - final_knn_num_candidates = self.knn_text_num_candidates_long  
262 - text_knn_boost = self.knn_text_boost * 1.4  
263 - recall_clauses.append({  
264 - "knn": {  
265 - "field": self.text_embedding_field,  
266 - "query_vector": query_vector.tolist(),  
267 - "k": final_knn_k,  
268 - "num_candidates": final_knn_num_candidates,  
269 - "boost": text_knn_boost,  
270 - "_name": "knn_query",  
271 - }  
272 - }) 419 + text_knn_clause = self.build_text_knn_clause(
  420 + query_vector,
  421 + parsed_query=parsed_query,
  422 + query_name="knn_query",
  423 + )
  424 + if text_knn_clause:
  425 + recall_clauses.append(text_knn_clause)
273 426
274 if has_image_embedding: 427 if has_image_embedding:
275 - nested_path, _, _ = str(self.image_embedding_field).rpartition(".")  
276 - image_knn_query = {  
277 - "field": self.image_embedding_field,  
278 - "query_vector": image_query_vector.tolist(),  
279 - "k": self.knn_image_k,  
280 - "num_candidates": self.knn_image_num_candidates,  
281 - "boost": self.knn_image_boost,  
282 - }  
283 - if nested_path:  
284 - recall_clauses.append({  
285 - "nested": {  
286 - "path": nested_path,  
287 - "_name": "image_knn_query",  
288 - "query": {"knn": image_knn_query},  
289 - "score_mode": "max",  
290 - }  
291 - })  
292 - else:  
293 - recall_clauses.append({  
294 - "knn": {  
295 - **image_knn_query,  
296 - "_name": "image_knn_query",  
297 - }  
298 - }) 428 + image_knn_clause = self.build_image_knn_clause(
  429 + image_query_vector,
  430 + query_name="image_knn_query",
  431 + )
  432 + if image_knn_clause:
  433 + recall_clauses.append(image_knn_clause)
299 434
300 # 4. Build main query structure: filters and recall 435 # 4. Build main query structure: filters and recall
301 if recall_clauses: 436 if recall_clauses:
search/rerank_client.py
@@ -153,12 +153,59 @@ def _extract_named_query_score(matched_queries: Any, name: str) -&gt; float: @@ -153,12 +153,59 @@ def _extract_named_query_score(matched_queries: Any, name: str) -&gt; float:
153 return 0.0 153 return 0.0
154 154
155 155
  156 +def _resolve_named_query_score(
  157 + matched_queries: Any,
  158 + *,
  159 + preferred_names: List[str],
  160 + fallback_names: List[str],
  161 +) -> Tuple[float, Optional[str], float, Optional[str]]:
  162 + preferred_score = 0.0
  163 + preferred_name: Optional[str] = None
  164 + for name in preferred_names:
  165 + score = _extract_named_query_score(matched_queries, name)
  166 + if score > 0.0:
  167 + preferred_score = score
  168 + preferred_name = name
  169 + break
  170 +
  171 + fallback_score = 0.0
  172 + fallback_name: Optional[str] = None
  173 + for name in fallback_names:
  174 + score = _extract_named_query_score(matched_queries, name)
  175 + if score > 0.0:
  176 + fallback_score = score
  177 + fallback_name = name
  178 + break
  179 +
  180 + if preferred_name is None and preferred_names:
  181 + preferred_name = preferred_names[0]
  182 + preferred_score = _extract_named_query_score(matched_queries, preferred_name)
  183 + if fallback_name is None and fallback_names:
  184 + fallback_name = fallback_names[0]
  185 + fallback_score = _extract_named_query_score(matched_queries, fallback_name)
  186 + if preferred_score > 0.0:
  187 + return preferred_score, preferred_name, fallback_score, fallback_name
  188 + return fallback_score, fallback_name, preferred_score, preferred_name
  189 +
  190 +
156 def _collect_knn_score_components( 191 def _collect_knn_score_components(
157 matched_queries: Any, 192 matched_queries: Any,
158 fusion: RerankFusionConfig, 193 fusion: RerankFusionConfig,
159 ) -> Dict[str, float]: 194 ) -> Dict[str, float]:
160 - text_knn_score = _extract_named_query_score(matched_queries, "knn_query")  
161 - image_knn_score = _extract_named_query_score(matched_queries, "image_knn_query") 195 + text_knn_score, text_knn_source, _, _ = _resolve_named_query_score(
  196 + matched_queries,
  197 + preferred_names=["exact_text_knn_query"],
  198 + fallback_names=["knn_query"],
  199 + )
  200 + image_knn_score, image_knn_source, _, _ = _resolve_named_query_score(
  201 + matched_queries,
  202 + preferred_names=["exact_image_knn_query"],
  203 + fallback_names=["image_knn_query"],
  204 + )
  205 + exact_text_knn_score = _extract_named_query_score(matched_queries, "exact_text_knn_query")
  206 + exact_image_knn_score = _extract_named_query_score(matched_queries, "exact_image_knn_query")
  207 + approx_text_knn_score = _extract_named_query_score(matched_queries, "knn_query")
  208 + approx_image_knn_score = _extract_named_query_score(matched_queries, "image_knn_query")
162 209
163 weighted_text_knn_score = text_knn_score * float(fusion.knn_text_weight) 210 weighted_text_knn_score = text_knn_score * float(fusion.knn_text_weight)
164 weighted_image_knn_score = image_knn_score * float(fusion.knn_image_weight) 211 weighted_image_knn_score = image_knn_score * float(fusion.knn_image_weight)
@@ -171,6 +218,14 @@ def _collect_knn_score_components( @@ -171,6 +218,14 @@ def _collect_knn_score_components(
171 return { 218 return {
172 "text_knn_score": text_knn_score, 219 "text_knn_score": text_knn_score,
173 "image_knn_score": image_knn_score, 220 "image_knn_score": image_knn_score,
  221 + "exact_text_knn_score": exact_text_knn_score,
  222 + "exact_image_knn_score": exact_image_knn_score,
  223 + "approx_text_knn_score": approx_text_knn_score,
  224 + "approx_image_knn_score": approx_image_knn_score,
  225 + "text_knn_source": text_knn_source,
  226 + "image_knn_source": image_knn_source,
  227 + "approx_text_knn_source": "knn_query",
  228 + "approx_image_knn_source": "image_knn_query",
174 "weighted_text_knn_score": weighted_text_knn_score, 229 "weighted_text_knn_score": weighted_text_knn_score,
175 "weighted_image_knn_score": weighted_image_knn_score, 230 "weighted_image_knn_score": weighted_image_knn_score,
176 "primary_knn_score": primary_knn_score, 231 "primary_knn_score": primary_knn_score,
@@ -322,6 +377,10 @@ def _build_ltr_feature_block( @@ -322,6 +377,10 @@ def _build_ltr_feature_block(
322 "text_support_score": float(text_components["support_text_score"]), 377 "text_support_score": float(text_components["support_text_score"]),
323 "text_knn_score": text_knn_score, 378 "text_knn_score": text_knn_score,
324 "image_knn_score": image_knn_score, 379 "image_knn_score": image_knn_score,
  380 + "exact_text_knn_score": float(knn_components["exact_text_knn_score"]),
  381 + "exact_image_knn_score": float(knn_components["exact_image_knn_score"]),
  382 + "approx_text_knn_score": float(knn_components["approx_text_knn_score"]),
  383 + "approx_image_knn_score": float(knn_components["approx_image_knn_score"]),
325 "knn_primary_score": float(knn_components["primary_knn_score"]), 384 "knn_primary_score": float(knn_components["primary_knn_score"]),
326 "knn_support_score": float(knn_components["support_knn_score"]), 385 "knn_support_score": float(knn_components["support_knn_score"]),
327 "has_text_match": source_score > 0.0, 386 "has_text_match": source_score > 0.0,
@@ -337,12 +396,50 @@ def _build_ltr_feature_block( @@ -337,12 +396,50 @@ def _build_ltr_feature_block(
337 } 396 }
338 397
339 398
  399 +def _maybe_append_weighted_knn_terms(
  400 + *,
  401 + term_rows: List[Dict[str, Any]],
  402 + fusion: CoarseRankFusionConfig | RerankFusionConfig,
  403 + knn_components: Optional[Dict[str, Any]],
  404 +) -> None:
  405 + if not knn_components:
  406 + return
  407 +
  408 + weighted_text_knn_score = _to_score(knn_components.get("weighted_text_knn_score"))
  409 + weighted_image_knn_score = _to_score(knn_components.get("weighted_image_knn_score"))
  410 +
  411 + if float(getattr(fusion, "knn_text_exponent", 0.0)) != 0.0:
  412 + text_bias = float(getattr(fusion, "knn_text_bias", fusion.knn_bias))
  413 + term_rows.append(
  414 + {
  415 + "name": "weighted_text_knn_score",
  416 + "raw_score": weighted_text_knn_score,
  417 + "bias": text_bias,
  418 + "exponent": float(fusion.knn_text_exponent),
  419 + "factor": (max(weighted_text_knn_score, 0.0) + text_bias) ** float(fusion.knn_text_exponent),
  420 + }
  421 + )
  422 + if float(getattr(fusion, "knn_image_exponent", 0.0)) != 0.0:
  423 + image_bias = float(getattr(fusion, "knn_image_bias", fusion.knn_bias))
  424 + term_rows.append(
  425 + {
  426 + "name": "weighted_image_knn_score",
  427 + "raw_score": weighted_image_knn_score,
  428 + "bias": image_bias,
  429 + "exponent": float(fusion.knn_image_exponent),
  430 + "factor": (max(weighted_image_knn_score, 0.0) + image_bias)
  431 + ** float(fusion.knn_image_exponent),
  432 + }
  433 + )
  434 +
  435 +
340 def _compute_multiplicative_fusion( 436 def _compute_multiplicative_fusion(
341 *, 437 *,
342 es_score: float, 438 es_score: float,
343 text_score: float, 439 text_score: float,
344 knn_score: float, 440 knn_score: float,
345 fusion: RerankFusionConfig, 441 fusion: RerankFusionConfig,
  442 + knn_components: Optional[Dict[str, Any]] = None,
346 rerank_score: Optional[float] = None, 443 rerank_score: Optional[float] = None,
347 fine_score: Optional[float] = None, 444 fine_score: Optional[float] = None,
348 style_boost: float = 1.0, 445 style_boost: float = 1.0,
@@ -368,6 +465,7 @@ def _compute_multiplicative_fusion( @@ -368,6 +465,7 @@ def _compute_multiplicative_fusion(
368 _add_term("fine_score", fine_score, fusion.fine_bias, fusion.fine_exponent) 465 _add_term("fine_score", fine_score, fusion.fine_bias, fusion.fine_exponent)
369 _add_term("text_score", text_score, fusion.text_bias, fusion.text_exponent) 466 _add_term("text_score", text_score, fusion.text_bias, fusion.text_exponent)
370 _add_term("knn_score", knn_score, fusion.knn_bias, fusion.knn_exponent) 467 _add_term("knn_score", knn_score, fusion.knn_bias, fusion.knn_exponent)
  468 + _maybe_append_weighted_knn_terms(term_rows=term_rows, fusion=fusion, knn_components=knn_components)
371 469
372 fused = 1.0 470 fused = 1.0
373 factors: Dict[str, float] = {} 471 factors: Dict[str, float] = {}
@@ -391,12 +489,30 @@ def _multiply_coarse_fusion_factors( @@ -391,12 +489,30 @@ def _multiply_coarse_fusion_factors(
391 es_score: float, 489 es_score: float,
392 text_score: float, 490 text_score: float,
393 knn_score: float, 491 knn_score: float,
  492 + knn_components: Dict[str, Any],
394 fusion: CoarseRankFusionConfig, 493 fusion: CoarseRankFusionConfig,
395 -) -> Tuple[float, float, float, float]: 494 +) -> Tuple[float, float, float, float, float, float]:
396 es_factor = (max(es_score, 0.0) + fusion.es_bias) ** fusion.es_exponent 495 es_factor = (max(es_score, 0.0) + fusion.es_bias) ** fusion.es_exponent
397 text_factor = (max(text_score, 0.0) + fusion.text_bias) ** fusion.text_exponent 496 text_factor = (max(text_score, 0.0) + fusion.text_bias) ** fusion.text_exponent
398 knn_factor = (max(knn_score, 0.0) + fusion.knn_bias) ** fusion.knn_exponent 497 knn_factor = (max(knn_score, 0.0) + fusion.knn_bias) ** fusion.knn_exponent
399 - return es_factor, text_factor, knn_factor, es_factor * text_factor * knn_factor 498 + text_knn_bias = float(getattr(fusion, "knn_text_bias", fusion.knn_bias))
  499 + image_knn_bias = float(getattr(fusion, "knn_image_bias", fusion.knn_bias))
  500 + text_knn_factor = (
  501 + (max(_to_score(knn_components.get("weighted_text_knn_score")), 0.0) + text_knn_bias)
  502 + ** float(getattr(fusion, "knn_text_exponent", 0.0))
  503 + )
  504 + image_knn_factor = (
  505 + (max(_to_score(knn_components.get("weighted_image_knn_score")), 0.0) + image_knn_bias)
  506 + ** float(getattr(fusion, "knn_image_exponent", 0.0))
  507 + )
  508 + return (
  509 + es_factor,
  510 + text_factor,
  511 + knn_factor,
  512 + text_knn_factor,
  513 + image_knn_factor,
  514 + es_factor * text_factor * knn_factor * text_knn_factor * image_knn_factor,
  515 + )
400 516
401 517
402 def _has_selected_sku(hit: Dict[str, Any]) -> bool: 518 def _has_selected_sku(hit: Dict[str, Any]) -> bool:
@@ -422,10 +538,18 @@ def coarse_resort_hits( @@ -422,10 +538,18 @@ def coarse_resort_hits(
422 knn_components = signal_bundle["knn_components"] 538 knn_components = signal_bundle["knn_components"]
423 text_score = signal_bundle["text_score"] 539 text_score = signal_bundle["text_score"]
424 knn_score = signal_bundle["knn_score"] 540 knn_score = signal_bundle["knn_score"]
425 - es_factor, text_factor, knn_factor, coarse_score = _multiply_coarse_fusion_factors( 541 + (
  542 + es_factor,
  543 + text_factor,
  544 + knn_factor,
  545 + text_knn_factor,
  546 + image_knn_factor,
  547 + coarse_score,
  548 + ) = _multiply_coarse_fusion_factors(
426 es_score=es_score, 549 es_score=es_score,
427 text_score=text_score, 550 text_score=text_score,
428 knn_score=knn_score, 551 knn_score=knn_score,
  552 + knn_components=knn_components,
429 fusion=f, 553 fusion=f,
430 ) 554 )
431 555
@@ -433,6 +557,8 @@ def coarse_resort_hits( @@ -433,6 +557,8 @@ def coarse_resort_hits(
433 hit["_knn_score"] = knn_score 557 hit["_knn_score"] = knn_score
434 hit["_text_knn_score"] = knn_components["text_knn_score"] 558 hit["_text_knn_score"] = knn_components["text_knn_score"]
435 hit["_image_knn_score"] = knn_components["image_knn_score"] 559 hit["_image_knn_score"] = knn_components["image_knn_score"]
  560 + hit["_exact_text_knn_score"] = knn_components["exact_text_knn_score"]
  561 + hit["_exact_image_knn_score"] = knn_components["exact_image_knn_score"]
436 hit["_coarse_score"] = coarse_score 562 hit["_coarse_score"] = coarse_score
437 563
438 if debug: 564 if debug:
@@ -460,6 +586,12 @@ def coarse_resort_hits( @@ -460,6 +586,12 @@ def coarse_resort_hits(
460 ), 586 ),
461 "text_knn_score": knn_components["text_knn_score"], 587 "text_knn_score": knn_components["text_knn_score"],
462 "image_knn_score": knn_components["image_knn_score"], 588 "image_knn_score": knn_components["image_knn_score"],
  589 + "exact_text_knn_score": knn_components["exact_text_knn_score"],
  590 + "exact_image_knn_score": knn_components["exact_image_knn_score"],
  591 + "approx_text_knn_score": knn_components["approx_text_knn_score"],
  592 + "approx_image_knn_score": knn_components["approx_image_knn_score"],
  593 + "text_knn_source": knn_components["text_knn_source"],
  594 + "image_knn_source": knn_components["image_knn_source"],
463 "weighted_text_knn_score": knn_components["weighted_text_knn_score"], 595 "weighted_text_knn_score": knn_components["weighted_text_knn_score"],
464 "weighted_image_knn_score": knn_components["weighted_image_knn_score"], 596 "weighted_image_knn_score": knn_components["weighted_image_knn_score"],
465 "knn_primary_score": knn_components["primary_knn_score"], 597 "knn_primary_score": knn_components["primary_knn_score"],
@@ -468,6 +600,8 @@ def coarse_resort_hits( @@ -468,6 +600,8 @@ def coarse_resort_hits(
468 "coarse_es_factor": es_factor, 600 "coarse_es_factor": es_factor,
469 "coarse_text_factor": text_factor, 601 "coarse_text_factor": text_factor,
470 "coarse_knn_factor": knn_factor, 602 "coarse_knn_factor": knn_factor,
  603 + "coarse_text_knn_factor": text_knn_factor,
  604 + "coarse_image_knn_factor": image_knn_factor,
471 "coarse_score": coarse_score, 605 "coarse_score": coarse_score,
472 "matched_queries": matched_queries, 606 "matched_queries": matched_queries,
473 "ltr_features": ltr_features, 607 "ltr_features": ltr_features,
@@ -509,7 +643,7 @@ def fuse_scores_and_resort( @@ -509,7 +643,7 @@ def fuse_scores_and_resort(
509 - _rerank_score: 重排服务返回的分数 643 - _rerank_score: 重排服务返回的分数
510 - _fused_score: 融合分数 644 - _fused_score: 融合分数
511 - _text_score: 文本相关性分数(优先取 named queries 的 base_query 分数) 645 - _text_score: 文本相关性分数(优先取 named queries 的 base_query 分数)
512 - - _knn_score: KNN 分数(优先取 named queries 的 knn_query 分数 646 + - _knn_score: KNN 分数(优先取 exact named queries,缺失时回退 ANN named queries
513 647
514 Args: 648 Args:
515 es_hits: ES hits 列表(会被原地修改) 649 es_hits: ES hits 列表(会被原地修改)
@@ -545,6 +679,7 @@ def fuse_scores_and_resort( @@ -545,6 +679,7 @@ def fuse_scores_and_resort(
545 text_score=text_score, 679 text_score=text_score,
546 knn_score=knn_score, 680 knn_score=knn_score,
547 fusion=f, 681 fusion=f,
  682 + knn_components=knn_components,
548 style_boost=style_boost, 683 style_boost=style_boost,
549 ) 684 )
550 fused = fusion_result["score"] 685 fused = fusion_result["score"]
@@ -557,6 +692,8 @@ def fuse_scores_and_resort( @@ -557,6 +692,8 @@ def fuse_scores_and_resort(
557 hit["_knn_score"] = knn_score 692 hit["_knn_score"] = knn_score
558 hit["_text_knn_score"] = knn_components["text_knn_score"] 693 hit["_text_knn_score"] = knn_components["text_knn_score"]
559 hit["_image_knn_score"] = knn_components["image_knn_score"] 694 hit["_image_knn_score"] = knn_components["image_knn_score"]
  695 + hit["_exact_text_knn_score"] = knn_components["exact_text_knn_score"]
  696 + hit["_exact_image_knn_score"] = knn_components["exact_image_knn_score"]
560 hit["_fused_score"] = fused 697 hit["_fused_score"] = fused
561 hit["_style_intent_selected_sku_boost"] = style_boost 698 hit["_style_intent_selected_sku_boost"] = style_boost
562 699
@@ -589,6 +726,12 @@ def fuse_scores_and_resort( @@ -589,6 +726,12 @@ def fuse_scores_and_resort(
589 "text_support_score": text_components["support_text_score"], 726 "text_support_score": text_components["support_text_score"],
590 "text_knn_score": knn_components["text_knn_score"], 727 "text_knn_score": knn_components["text_knn_score"],
591 "image_knn_score": knn_components["image_knn_score"], 728 "image_knn_score": knn_components["image_knn_score"],
  729 + "exact_text_knn_score": knn_components["exact_text_knn_score"],
  730 + "exact_image_knn_score": knn_components["exact_image_knn_score"],
  731 + "approx_text_knn_score": knn_components["approx_text_knn_score"],
  732 + "approx_image_knn_score": knn_components["approx_image_knn_score"],
  733 + "text_knn_source": knn_components["text_knn_source"],
  734 + "image_knn_source": knn_components["image_knn_source"],
592 "weighted_text_knn_score": knn_components["weighted_text_knn_score"], 735 "weighted_text_knn_score": knn_components["weighted_text_knn_score"],
593 "weighted_image_knn_score": knn_components["weighted_image_knn_score"], 736 "weighted_image_knn_score": knn_components["weighted_image_knn_score"],
594 "knn_primary_score": knn_components["primary_knn_score"], 737 "knn_primary_score": knn_components["primary_knn_score"],
@@ -603,6 +746,8 @@ def fuse_scores_and_resort( @@ -603,6 +746,8 @@ def fuse_scores_and_resort(
603 "es_factor": fusion_result["factors"].get("es_score"), 746 "es_factor": fusion_result["factors"].get("es_score"),
604 "text_factor": fusion_result["factors"].get("text_score"), 747 "text_factor": fusion_result["factors"].get("text_score"),
605 "knn_factor": fusion_result["factors"].get("knn_score"), 748 "knn_factor": fusion_result["factors"].get("knn_score"),
  749 + "text_knn_factor": fusion_result["factors"].get("weighted_text_knn_score"),
  750 + "image_knn_factor": fusion_result["factors"].get("weighted_image_knn_score"),
606 "style_intent_selected_sku": sku_selected, 751 "style_intent_selected_sku": sku_selected,
607 "style_intent_selected_sku_boost": style_boost, 752 "style_intent_selected_sku_boost": style_boost,
608 "matched_queries": signal_bundle["matched_queries"], 753 "matched_queries": signal_bundle["matched_queries"],
@@ -735,6 +880,7 @@ def run_lightweight_rerank( @@ -735,6 +880,7 @@ def run_lightweight_rerank(
735 text_score=text_score, 880 text_score=text_score,
736 knn_score=knn_score, 881 knn_score=knn_score,
737 fusion=f, 882 fusion=f,
  883 + knn_components=signal_bundle["knn_components"],
738 style_boost=style_boost, 884 style_boost=style_boost,
739 ) 885 )
740 886
@@ -744,6 +890,8 @@ def run_lightweight_rerank( @@ -744,6 +890,8 @@ def run_lightweight_rerank(
744 hit["_knn_score"] = knn_score 890 hit["_knn_score"] = knn_score
745 hit["_text_knn_score"] = signal_bundle["knn_components"]["text_knn_score"] 891 hit["_text_knn_score"] = signal_bundle["knn_components"]["text_knn_score"]
746 hit["_image_knn_score"] = signal_bundle["knn_components"]["image_knn_score"] 892 hit["_image_knn_score"] = signal_bundle["knn_components"]["image_knn_score"]
  893 + hit["_exact_text_knn_score"] = signal_bundle["knn_components"]["exact_text_knn_score"]
  894 + hit["_exact_image_knn_score"] = signal_bundle["knn_components"]["exact_image_knn_score"]
747 hit["_style_intent_selected_sku_boost"] = style_boost 895 hit["_style_intent_selected_sku_boost"] = style_boost
748 896
749 if debug: 897 if debug:
@@ -769,6 +917,8 @@ def run_lightweight_rerank( @@ -769,6 +917,8 @@ def run_lightweight_rerank(
769 "es_factor": fusion_result["factors"].get("es_score"), 917 "es_factor": fusion_result["factors"].get("es_score"),
770 "text_factor": fusion_result["factors"].get("text_score"), 918 "text_factor": fusion_result["factors"].get("text_score"),
771 "knn_factor": fusion_result["factors"].get("knn_score"), 919 "knn_factor": fusion_result["factors"].get("knn_score"),
  920 + "text_knn_factor": fusion_result["factors"].get("weighted_text_knn_score"),
  921 + "image_knn_factor": fusion_result["factors"].get("weighted_image_knn_score"),
772 "style_intent_selected_sku": sku_selected, 922 "style_intent_selected_sku": sku_selected,
773 "style_intent_selected_sku_boost": style_boost, 923 "style_intent_selected_sku_boost": style_boost,
774 "ltr_features": ltr_features, 924 "ltr_features": ltr_features,
search/searcher.py
@@ -236,6 +236,81 @@ class Searcher: @@ -236,6 +236,81 @@ class Searcher:
236 return 236 return
237 es_query["_source"] = {"includes": self.source_fields} 237 es_query["_source"] = {"includes": self.source_fields}
238 238
  239 + def _resolve_exact_knn_rescore_window(self) -> int:
  240 + configured = int(self.config.rerank.exact_knn_rescore_window)
  241 + if configured > 0:
  242 + return configured
  243 + return int(self.config.rerank.rerank_window)
  244 +
  245 + def _build_exact_knn_rescore(
  246 + self,
  247 + *,
  248 + query_vector: Any,
  249 + image_query_vector: Any,
  250 + parsed_query: Optional[ParsedQuery] = None,
  251 + ) -> Optional[Dict[str, Any]]:
  252 + clauses: List[Dict[str, Any]] = []
  253 +
  254 + text_clause = self.query_builder.build_exact_text_knn_rescore_clause(
  255 + query_vector,
  256 + parsed_query=parsed_query,
  257 + query_name="exact_text_knn_query",
  258 + )
  259 + if text_clause:
  260 + clauses.append(text_clause)
  261 +
  262 + image_clause = self.query_builder.build_exact_image_knn_rescore_clause(
  263 + image_query_vector,
  264 + query_name="exact_image_knn_query",
  265 + )
  266 + if image_clause:
  267 + clauses.append(image_clause)
  268 +
  269 + if not clauses:
  270 + return None
  271 +
  272 + return {
  273 + "window_size": self._resolve_exact_knn_rescore_window(),
  274 + "query": {
  275 + # Phase 1: only compute exact vector scores and expose them in matched_queries.
  276 + "score_mode": "total",
  277 + "query_weight": 1.0,
  278 + "rescore_query_weight": 0.0,
  279 + "rescore_query": {
  280 + "bool": {
  281 + "should": clauses,
  282 + "minimum_should_match": 1,
  283 + }
  284 + },
  285 + },
  286 + }
  287 +
  288 + def _attach_exact_knn_rescore(
  289 + self,
  290 + es_query: Dict[str, Any],
  291 + *,
  292 + in_rank_window: bool,
  293 + query_vector: Any,
  294 + image_query_vector: Any,
  295 + parsed_query: Optional[ParsedQuery] = None,
  296 + ) -> None:
  297 + if not in_rank_window or not self.config.rerank.exact_knn_rescore_enabled:
  298 + return
  299 + rescore = self._build_exact_knn_rescore(
  300 + query_vector=query_vector,
  301 + image_query_vector=image_query_vector,
  302 + parsed_query=parsed_query,
  303 + )
  304 + if not rescore:
  305 + return
  306 + existing = es_query.get("rescore")
  307 + if existing is None:
  308 + es_query["rescore"] = rescore
  309 + elif isinstance(existing, list):
  310 + es_query["rescore"] = [*existing, rescore]
  311 + else:
  312 + es_query["rescore"] = [existing, rescore]
  313 +
239 def _resolve_rerank_source_filter( 314 def _resolve_rerank_source_filter(
240 self, 315 self,
241 doc_template: str, 316 doc_template: str,
@@ -401,7 +476,9 @@ class Searcher: @@ -401,7 +476,9 @@ class Searcher:
401 language: Response / field selection language hint (e.g. zh, en) 476 language: Response / field selection language hint (e.g. zh, en)
402 sku_filter_dimension: SKU grouping dimensions for per-SPU variant pick 477 sku_filter_dimension: SKU grouping dimensions for per-SPU variant pick
403 enable_rerank: If None, use ``config.rerank.enabled``; if set, overrides 478 enable_rerank: If None, use ``config.rerank.enabled``; if set, overrides
404 - whether the rerank provider is invoked (subject to rerank window). 479 + whether the final rerank provider is invoked (subject to rank window).
  480 + When false, the ranking pipeline still runs and rerank stage becomes
  481 + pass-through.
405 rerank_query_template: Override for rerank query text template; None uses 482 rerank_query_template: Override for rerank query text template; None uses
406 ``config.rerank.rerank_query_template`` (e.g. ``"{query}"``). 483 ``config.rerank.rerank_query_template`` (e.g. ``"{query}"``).
407 rerank_doc_template: Override for per-hit document text passed to rerank; 484 rerank_doc_template: Override for per-hit document text passed to rerank;
@@ -430,15 +507,16 @@ class Searcher: @@ -430,15 +507,16 @@ class Searcher:
430 # 重排开关优先级:请求参数显式传值 > 服务端配置(默认开启) 507 # 重排开关优先级:请求参数显式传值 > 服务端配置(默认开启)
431 rerank_enabled_by_config = bool(rc.enabled) 508 rerank_enabled_by_config = bool(rc.enabled)
432 do_rerank = rerank_enabled_by_config if enable_rerank is None else bool(enable_rerank) 509 do_rerank = rerank_enabled_by_config if enable_rerank is None else bool(enable_rerank)
  510 + fine_enabled = bool(fine_cfg.enabled)
433 rerank_window = rc.rerank_window 511 rerank_window = rc.rerank_window
434 coarse_input_window = max(rerank_window, int(coarse_cfg.input_window)) 512 coarse_input_window = max(rerank_window, int(coarse_cfg.input_window))
435 coarse_output_window = max(rerank_window, int(coarse_cfg.output_window)) 513 coarse_output_window = max(rerank_window, int(coarse_cfg.output_window))
436 fine_input_window = max(rerank_window, int(fine_cfg.input_window)) 514 fine_input_window = max(rerank_window, int(fine_cfg.input_window))
437 fine_output_window = max(rerank_window, int(fine_cfg.output_window)) 515 fine_output_window = max(rerank_window, int(fine_cfg.output_window))
438 - # 若开启重排且请求范围在窗口内:从 ES 取前 rerank_window 条、重排后再按 from/size 分页;否则不重排,按原 from/size 查 ES  
439 - in_rerank_window = do_rerank and (from_ + size) <= rerank_window  
440 - es_fetch_from = 0 if in_rerank_window else from_  
441 - es_fetch_size = coarse_input_window if in_rerank_window else size 516 + # 多阶段排序窗口独立于最终 rerank 开关:即使关闭最终 rerank,也保留 coarse/fine 流程。
  517 + in_rank_window = (from_ + size) <= rerank_window
  518 + es_fetch_from = 0 if in_rank_window else from_
  519 + es_fetch_size = coarse_input_window if in_rank_window else size
442 520
443 es_score_normalization_factor: Optional[float] = None 521 es_score_normalization_factor: Optional[float] = None
444 initial_ranks_by_doc: Dict[str, int] = {} 522 initial_ranks_by_doc: Dict[str, int] = {}
@@ -455,7 +533,8 @@ class Searcher: @@ -455,7 +533,8 @@ class Searcher:
455 context.logger.info( 533 context.logger.info(
456 f"开始搜索请求 | 查询: '{query}' | 参数: size={size}, from_={from_}, " 534 f"开始搜索请求 | 查询: '{query}' | 参数: size={size}, from_={from_}, "
457 f"enable_rerank(request)={enable_rerank}, enable_rerank(config)={rerank_enabled_by_config}, " 535 f"enable_rerank(request)={enable_rerank}, enable_rerank(config)={rerank_enabled_by_config}, "
458 - f"enable_rerank(effective)={do_rerank}, in_rerank_window={in_rerank_window}, " 536 + f"fine_enabled(config)={fine_enabled}, "
  537 + f"enable_rerank(effective)={do_rerank}, in_rank_window={in_rank_window}, "
459 f"es_fetch=({es_fetch_from},{es_fetch_size}) | " 538 f"es_fetch=({es_fetch_from},{es_fetch_size}) | "
460 f"index_languages={index_langs} | " 539 f"index_languages={index_langs} | "
461 f"enable_translation={enable_translation}, enable_embedding={enable_embedding}, min_score={min_score}", 540 f"enable_translation={enable_translation}, enable_embedding={enable_embedding}, min_score={min_score}",
@@ -468,8 +547,9 @@ class Searcher: @@ -468,8 +547,9 @@ class Searcher:
468 'from_': from_, 547 'from_': from_,
469 'es_fetch_from': es_fetch_from, 548 'es_fetch_from': es_fetch_from,
470 'es_fetch_size': es_fetch_size, 549 'es_fetch_size': es_fetch_size,
471 - 'in_rerank_window': in_rerank_window, 550 + 'in_rank_window': in_rank_window,
472 'rerank_enabled_by_config': rerank_enabled_by_config, 551 'rerank_enabled_by_config': rerank_enabled_by_config,
  552 + 'fine_enabled': fine_enabled,
473 'enable_rerank_request': enable_rerank, 553 'enable_rerank_request': enable_rerank,
474 'rerank_query_template': effective_query_template, 554 'rerank_query_template': effective_query_template,
475 'rerank_doc_template': effective_doc_template, 555 'rerank_doc_template': effective_doc_template,
@@ -494,6 +574,7 @@ class Searcher: @@ -494,6 +574,7 @@ class Searcher:
494 context.metadata['feature_flags'] = { 574 context.metadata['feature_flags'] = {
495 'translation_enabled': enable_translation, 575 'translation_enabled': enable_translation,
496 'embedding_enabled': enable_embedding, 576 'embedding_enabled': enable_embedding,
  577 + 'fine_enabled': fine_enabled,
497 'rerank_enabled': do_rerank, 578 'rerank_enabled': do_rerank,
498 'style_intent_enabled': bool(self.style_intent_registry.enabled), 579 'style_intent_enabled': bool(self.style_intent_registry.enabled),
499 } 580 }
@@ -526,7 +607,7 @@ class Searcher: @@ -526,7 +607,7 @@ class Searcher:
526 f"语言: {parsed_query.detected_language} | " 607 f"语言: {parsed_query.detected_language} | "
527 f"关键词: {parsed_query.keywords_queries} | " 608 f"关键词: {parsed_query.keywords_queries} | "
528 f"文本向量: {'是' if parsed_query.query_vector is not None else '否'} | " 609 f"文本向量: {'是' if parsed_query.query_vector is not None else '否'} | "
529 - f"图片向量: {'是' if getattr(parsed_query, 'image_query_vector', None) is not None else '否'}", 610 + f"图片向量: {'是' if parsed_query.image_query_vector is not None else '否'}",
530 extra={'reqid': context.reqid, 'uid': context.uid} 611 extra={'reqid': context.reqid, 'uid': context.uid}
531 ) 612 )
532 except Exception as e: 613 except Exception as e:
@@ -545,17 +626,16 @@ class Searcher: @@ -545,17 +626,16 @@ class Searcher:
545 # Generate tenant-specific index name 626 # Generate tenant-specific index name
546 index_name = get_tenant_index_name(tenant_id) 627 index_name = get_tenant_index_name(tenant_id)
547 # index_name = "search_products" 628 # index_name = "search_products"
548 - 629 +
549 # No longer need to add tenant_id to filters since each tenant has its own index 630 # No longer need to add tenant_id to filters since each tenant has its own index
  631 + image_query_vector = None
  632 + if enable_embedding:
  633 + image_query_vector = parsed_query.image_query_vector
550 634
551 es_query = self.query_builder.build_query( 635 es_query = self.query_builder.build_query(
552 query_text=parsed_query.rewritten_query or parsed_query.query_normalized, 636 query_text=parsed_query.rewritten_query or parsed_query.query_normalized,
553 query_vector=parsed_query.query_vector if enable_embedding else None, 637 query_vector=parsed_query.query_vector if enable_embedding else None,
554 - image_query_vector=(  
555 - getattr(parsed_query, "image_query_vector", None)  
556 - if enable_embedding  
557 - else None  
558 - ), 638 + image_query_vector=image_query_vector,
559 filters=filters, 639 filters=filters,
560 range_filters=range_filters, 640 range_filters=range_filters,
561 facet_configs=facets, 641 facet_configs=facets,
@@ -563,11 +643,18 @@ class Searcher: @@ -563,11 +643,18 @@ class Searcher:
563 from_=es_fetch_from, 643 from_=es_fetch_from,
564 enable_knn=enable_embedding and ( 644 enable_knn=enable_embedding and (
565 parsed_query.query_vector is not None 645 parsed_query.query_vector is not None
566 - or getattr(parsed_query, "image_query_vector", None) is not None 646 + or image_query_vector is not None
567 ), 647 ),
568 min_score=min_score, 648 min_score=min_score,
569 parsed_query=parsed_query, 649 parsed_query=parsed_query,
570 ) 650 )
  651 + self._attach_exact_knn_rescore(
  652 + es_query,
  653 + in_rank_window=in_rank_window,
  654 + query_vector=parsed_query.query_vector if enable_embedding else None,
  655 + image_query_vector=image_query_vector,
  656 + parsed_query=parsed_query,
  657 + )
571 658
572 # Add facets for faceted search 659 # Add facets for faceted search
573 if facets: 660 if facets:
@@ -587,8 +674,7 @@ class Searcher: @@ -587,8 +674,7 @@ class Searcher:
587 674
588 # In multi-stage rank window, first pass only needs score signals for coarse rank. 675 # In multi-stage rank window, first pass only needs score signals for coarse rank.
589 es_query_for_fetch = es_query 676 es_query_for_fetch = es_query
590 - rerank_prefetch_source = None  
591 - if in_rerank_window: 677 + if in_rank_window:
592 es_query_for_fetch = dict(es_query) 678 es_query_for_fetch = dict(es_query)
593 es_query_for_fetch["_source"] = False 679 es_query_for_fetch["_source"] = False
594 680
@@ -597,31 +683,28 @@ class Searcher: @@ -597,31 +683,28 @@ class Searcher:
597 683
598 # Store ES query in context 684 # Store ES query in context
599 context.store_intermediate_result('es_query', es_query) 685 context.store_intermediate_result('es_query', es_query)
600 - if in_rerank_window and rerank_prefetch_source is not None:  
601 - context.store_intermediate_result('es_query_rerank_prefetch_source', rerank_prefetch_source)  
602 # Serialize ES query to compute a compact size + stable digest for correlation 686 # Serialize ES query to compute a compact size + stable digest for correlation
603 es_query_compact = json.dumps(es_query_for_fetch, ensure_ascii=False, separators=(",", ":")) 687 es_query_compact = json.dumps(es_query_for_fetch, ensure_ascii=False, separators=(",", ":"))
604 es_query_digest = hashlib.sha256(es_query_compact.encode("utf-8")).hexdigest()[:16] 688 es_query_digest = hashlib.sha256(es_query_compact.encode("utf-8")).hexdigest()[:16]
605 knn_enabled = bool(enable_embedding and ( 689 knn_enabled = bool(enable_embedding and (
606 parsed_query.query_vector is not None 690 parsed_query.query_vector is not None
607 - or getattr(parsed_query, "image_query_vector", None) is not None 691 + or image_query_vector is not None
608 )) 692 ))
609 vector_dims = int(len(parsed_query.query_vector)) if parsed_query.query_vector is not None else 0 693 vector_dims = int(len(parsed_query.query_vector)) if parsed_query.query_vector is not None else 0
610 image_vector_dims = ( 694 image_vector_dims = (
611 - int(len(parsed_query.image_query_vector))  
612 - if getattr(parsed_query, "image_query_vector", None) is not None 695 + int(len(image_query_vector))
  696 + if image_query_vector is not None
613 else 0 697 else 0
614 ) 698 )
615 699
616 context.logger.info( 700 context.logger.info(
617 - "ES query built | size: %s chars | digest: %s | KNN: %s | vector_dims: %s | image_vector_dims: %s | facets: %s | rerank_prefetch_source: %s", 701 + "ES query built | size: %s chars | digest: %s | KNN: %s | vector_dims: %s | image_vector_dims: %s | facets: %s",
618 len(es_query_compact), 702 len(es_query_compact),
619 es_query_digest, 703 es_query_digest,
620 "yes" if knn_enabled else "no", 704 "yes" if knn_enabled else "no",
621 vector_dims, 705 vector_dims,
622 image_vector_dims, 706 image_vector_dims,
623 "yes" if facets else "no", 707 "yes" if facets else "no",
624 - rerank_prefetch_source,  
625 extra={'reqid': context.reqid, 'uid': context.uid} 708 extra={'reqid': context.reqid, 'uid': context.uid}
626 ) 709 )
627 _log_backend_verbose({ 710 _log_backend_verbose({
@@ -656,7 +739,7 @@ class Searcher: @@ -656,7 +739,7 @@ class Searcher:
656 body=body_for_es, 739 body=body_for_es,
657 size=es_fetch_size, 740 size=es_fetch_size,
658 from_=es_fetch_from, 741 from_=es_fetch_from,
659 - include_named_queries_score=bool(do_rerank and in_rerank_window), 742 + include_named_queries_score=bool(in_rank_window),
660 ) 743 )
661 744
662 # Store ES response in context 745 # Store ES response in context
@@ -698,10 +781,177 @@ class Searcher: @@ -698,10 +781,177 @@ class Searcher:
698 context.end_stage(RequestContextStage.ELASTICSEARCH_SEARCH_PRIMARY) 781 context.end_stage(RequestContextStage.ELASTICSEARCH_SEARCH_PRIMARY)
699 782
700 style_intent_decisions: Dict[str, SkuSelectionDecision] = {} 783 style_intent_decisions: Dict[str, SkuSelectionDecision] = {}
701 - if do_rerank and in_rerank_window: 784 + if in_rank_window:
702 from dataclasses import asdict 785 from dataclasses import asdict
703 from config.services_config import get_rerank_backend_config, get_rerank_service_url 786 from config.services_config import get_rerank_backend_config, get_rerank_service_url
704 from .rerank_client import coarse_resort_hits, run_lightweight_rerank, run_rerank 787 from .rerank_client import coarse_resort_hits, run_lightweight_rerank, run_rerank
  788 + coarse_fusion_debug = asdict(coarse_cfg.fusion)
  789 + stage_fusion_debug = asdict(rc.fusion)
  790 +
  791 + def _rank_map(stage_hits: List[Dict[str, Any]]) -> Dict[str, int]:
  792 + return {
  793 + str(hit.get("_id")): rank
  794 + for rank, hit in enumerate(stage_hits, 1)
  795 + if hit.get("_id") is not None
  796 + }
  797 +
  798 + def _stage_debug_info(
  799 + *,
  800 + enabled: bool,
  801 + applied: bool,
  802 + skipped_reason: Optional[str],
  803 + service_profile: Optional[str],
  804 + query_template: str,
  805 + doc_template: str,
  806 + docs_in: int,
  807 + docs_out: int,
  808 + top_n: int,
  809 + meta: Optional[Dict[str, Any]] = None,
  810 + backend: Optional[str] = None,
  811 + backend_model_name: Optional[str] = None,
  812 + service_url: Optional[str] = None,
  813 + model: Optional[str] = None,
  814 + fusion: Optional[Dict[str, Any]] = None,
  815 + ) -> Dict[str, Any]:
  816 + return {
  817 + "enabled": enabled,
  818 + "applied": applied,
  819 + "passthrough": not applied,
  820 + "skipped_reason": skipped_reason,
  821 + "service_profile": service_profile,
  822 + "service_url": service_url,
  823 + "backend": backend,
  824 + "model": model,
  825 + "backend_model_name": backend_model_name,
  826 + "query_template": query_template,
  827 + "doc_template": doc_template,
  828 + "query_text": str(query_template).format_map({"query": rerank_query}),
  829 + "docs_in": docs_in,
  830 + "docs_out": docs_out,
  831 + "top_n": top_n,
  832 + "meta": meta,
  833 + "fusion": fusion,
  834 + }
  835 +
  836 + def _run_optional_stage(
  837 + *,
  838 + stage: RequestContextStage,
  839 + stage_label: str,
  840 + enabled: bool,
  841 + stage_hits: List[Dict[str, Any]],
  842 + input_limit: int,
  843 + output_limit: int,
  844 + service_profile: Optional[str],
  845 + query_template: str,
  846 + doc_template: str,
  847 + top_n: int,
  848 + debug_key: Optional[str],
  849 + runner,
  850 + ) -> tuple[List[Dict[str, Any]], Dict[str, int], Optional[Dict[str, Any]]]:
  851 + context.start_stage(stage)
  852 + try:
  853 + input_hits = list(stage_hits[:input_limit])
  854 + output_hits = list(stage_hits[:output_limit])
  855 + applied = False
  856 + skip_reason: Optional[str] = None
  857 + meta: Optional[Dict[str, Any]] = None
  858 + debug_rows: Optional[List[Dict[str, Any]]] = None
  859 +
  860 + if enabled and input_hits:
  861 + output_hits_candidate, applied, meta, debug_rows = runner(input_hits)
  862 + if applied:
  863 + output_hits = list((output_hits_candidate or input_hits)[:output_limit])
  864 + else:
  865 + skip_reason = "service_returned_none"
  866 + else:
  867 + skip_reason = "disabled" if not enabled else "no_hits"
  868 +
  869 + ranks = _rank_map(output_hits) if debug else {}
  870 + stage_info = None
  871 + if debug:
  872 + if applied:
  873 + backend_name, backend_cfg = get_rerank_backend_config(service_profile)
  874 + stage_info = _stage_debug_info(
  875 + enabled=True,
  876 + applied=True,
  877 + skipped_reason=None,
  878 + service_profile=service_profile,
  879 + service_url=get_rerank_service_url(profile=service_profile),
  880 + backend=backend_name,
  881 + backend_model_name=backend_cfg.get("model_name"),
  882 + model=meta.get("model") if isinstance(meta, dict) else None,
  883 + query_template=query_template,
  884 + doc_template=doc_template,
  885 + docs_in=len(input_hits),
  886 + docs_out=len(output_hits),
  887 + top_n=top_n,
  888 + meta=meta,
  889 + fusion=stage_fusion_debug,
  890 + )
  891 + if debug_key is not None and debug_rows is not None:
  892 + context.store_intermediate_result(debug_key, debug_rows)
  893 + else:
  894 + stage_info = _stage_debug_info(
  895 + enabled=enabled,
  896 + applied=False,
  897 + skipped_reason=skip_reason,
  898 + service_profile=service_profile,
  899 + query_template=query_template,
  900 + doc_template=doc_template,
  901 + docs_in=len(input_hits),
  902 + docs_out=len(output_hits),
  903 + top_n=top_n,
  904 + fusion=stage_fusion_debug,
  905 + )
  906 +
  907 + if applied:
  908 + context.logger.info(
  909 + "%s完成 | docs=%s | top_n=%s | meta=%s",
  910 + stage_label,
  911 + len(output_hits),
  912 + top_n,
  913 + meta,
  914 + extra={'reqid': context.reqid, 'uid': context.uid}
  915 + )
  916 + else:
  917 + context.logger.info(
  918 + "%s透传 | reason=%s | docs=%s | top_n=%s",
  919 + stage_label,
  920 + skip_reason,
  921 + len(output_hits),
  922 + top_n,
  923 + extra={'reqid': context.reqid, 'uid': context.uid}
  924 + )
  925 + return output_hits, ranks, stage_info
  926 + except Exception as e:
  927 + output_hits = list(stage_hits[:output_limit])
  928 + ranks = _rank_map(output_hits) if debug else {}
  929 + stage_info = None
  930 + if debug:
  931 + stage_info = _stage_debug_info(
  932 + enabled=enabled,
  933 + applied=False,
  934 + skipped_reason="error",
  935 + service_profile=service_profile,
  936 + query_template=query_template,
  937 + doc_template=doc_template,
  938 + docs_in=min(len(stage_hits), input_limit),
  939 + docs_out=len(output_hits),
  940 + top_n=top_n,
  941 + meta={"error": str(e)},
  942 + fusion=stage_fusion_debug,
  943 + )
  944 + context.add_warning(f"{stage_label} failed: {e}")
  945 + context.logger.warning(
  946 + "调用%s服务失败 | error: %s",
  947 + stage_label,
  948 + e,
  949 + extra={'reqid': context.reqid, 'uid': context.uid},
  950 + exc_info=True,
  951 + )
  952 + return output_hits, ranks, stage_info
  953 + finally:
  954 + context.end_stage(stage)
705 955
706 rerank_query = parsed_query.text_for_rerank() if parsed_query else query 956 rerank_query = parsed_query.text_for_rerank() if parsed_query else query
707 hits = es_response.get("hits", {}).get("hits") or [] 957 hits = es_response.get("hits", {}).get("hits") or []
@@ -716,17 +966,12 @@ class Searcher: @@ -716,17 +966,12 @@ class Searcher:
716 hits = hits[:coarse_output_window] 966 hits = hits[:coarse_output_window]
717 es_response.setdefault("hits", {})["hits"] = hits 967 es_response.setdefault("hits", {})["hits"] = hits
718 if debug: 968 if debug:
719 - coarse_ranks_by_doc = {  
720 - str(hit.get("_id")): rank  
721 - for rank, hit in enumerate(hits, 1)  
722 - if hit.get("_id") is not None 969 + coarse_ranks_by_doc = _rank_map(hits)
  970 + coarse_debug_info = {
  971 + "docs_in": es_fetch_size,
  972 + "docs_out": len(hits),
  973 + "fusion": coarse_fusion_debug,
723 } 974 }
724 - if debug:  
725 - coarse_debug_info = {  
726 - "docs_in": es_fetch_size,  
727 - "docs_out": len(hits),  
728 - "fusion": asdict(coarse_cfg.fusion),  
729 - }  
730 context.store_intermediate_result("coarse_rank_scores", coarse_debug) 975 context.store_intermediate_result("coarse_rank_scores", coarse_debug)
731 context.logger.info( 976 context.logger.info(
732 "粗排完成 | docs_in=%s | docs_out=%s", 977 "粗排完成 | docs_in=%s | docs_out=%s",
@@ -777,72 +1022,42 @@ class Searcher: @@ -777,72 +1022,42 @@ class Searcher:
777 extra={'reqid': context.reqid, 'uid': context.uid} 1022 extra={'reqid': context.reqid, 'uid': context.uid}
778 ) 1023 )
779 1024
780 - fine_scores: Optional[List[float]] = None  
781 - hits = es_response.get("hits", {}).get("hits") or []  
782 - if fine_cfg.enabled and hits:  
783 - context.start_stage(RequestContextStage.FINE_RANKING)  
784 - try:  
785 - fine_scores, fine_meta, fine_debug_rows = run_lightweight_rerank(  
786 - query=rerank_query,  
787 - es_hits=hits[:fine_input_window],  
788 - language=language,  
789 - timeout_sec=fine_cfg.timeout_sec,  
790 - rerank_query_template=fine_query_template,  
791 - rerank_doc_template=fine_doc_template,  
792 - top_n=fine_output_window,  
793 - debug=debug,  
794 - fusion=rc.fusion,  
795 - style_intent_selected_sku_boost=self.config.query_config.style_intent_selected_sku_boost,  
796 - service_profile=fine_cfg.service_profile,  
797 - )  
798 - if fine_scores is not None:  
799 - hits = hits[:fine_output_window]  
800 - es_response["hits"]["hits"] = hits  
801 - if debug:  
802 - fine_ranks_by_doc = {  
803 - str(hit.get("_id")): rank  
804 - for rank, hit in enumerate(hits, 1)  
805 - if hit.get("_id") is not None  
806 - }  
807 - fine_backend_name, fine_backend_cfg = get_rerank_backend_config(fine_cfg.service_profile)  
808 - fine_debug_info = {  
809 - "service_profile": fine_cfg.service_profile,  
810 - "service_url": get_rerank_service_url(profile=fine_cfg.service_profile),  
811 - "backend": fine_backend_name,  
812 - "model": fine_meta.get("model") if isinstance(fine_meta, dict) else None,  
813 - "backend_model_name": fine_backend_cfg.get("model_name"),  
814 - "query_template": fine_query_template,  
815 - "doc_template": fine_doc_template,  
816 - "query_text": str(fine_query_template).format_map({"query": rerank_query}),  
817 - "docs_in": min(len(fine_scores), fine_input_window),  
818 - "docs_out": len(hits),  
819 - "top_n": fine_output_window,  
820 - "meta": fine_meta,  
821 - "fusion": asdict(rc.fusion),  
822 - }  
823 - context.store_intermediate_result("fine_rank_scores", fine_debug_rows)  
824 - context.logger.info(  
825 - "精排完成 | docs=%s | top_n=%s | meta=%s",  
826 - len(hits),  
827 - fine_output_window,  
828 - fine_meta,  
829 - extra={'reqid': context.reqid, 'uid': context.uid}  
830 - )  
831 - except Exception as e:  
832 - context.add_warning(f"Fine rerank failed: {e}")  
833 - context.logger.warning(  
834 - f"调用精排服务失败 | error: {e}",  
835 - extra={'reqid': context.reqid, 'uid': context.uid},  
836 - exc_info=True,  
837 - )  
838 - finally:  
839 - context.end_stage(RequestContextStage.FINE_RANKING) 1025 + def _run_fine_stage(stage_input: List[Dict[str, Any]]):
  1026 + fine_scores, fine_meta, fine_debug_rows = run_lightweight_rerank(
  1027 + query=rerank_query,
  1028 + es_hits=stage_input,
  1029 + language=language,
  1030 + timeout_sec=fine_cfg.timeout_sec,
  1031 + rerank_query_template=fine_query_template,
  1032 + rerank_doc_template=fine_doc_template,
  1033 + top_n=fine_output_window,
  1034 + debug=debug,
  1035 + fusion=rc.fusion,
  1036 + style_intent_selected_sku_boost=self.config.query_config.style_intent_selected_sku_boost,
  1037 + service_profile=fine_cfg.service_profile,
  1038 + )
  1039 + return stage_input, fine_scores is not None, fine_meta, fine_debug_rows
  1040 +
  1041 + hits, fine_ranks_by_doc, fine_debug_info = _run_optional_stage(
  1042 + stage=RequestContextStage.FINE_RANKING,
  1043 + stage_label="精排",
  1044 + enabled=fine_enabled,
  1045 + stage_hits=es_response.get("hits", {}).get("hits") or [],
  1046 + input_limit=fine_input_window,
  1047 + output_limit=fine_output_window,
  1048 + service_profile=fine_cfg.service_profile,
  1049 + query_template=fine_query_template,
  1050 + doc_template=fine_doc_template,
  1051 + top_n=fine_output_window,
  1052 + debug_key="fine_rank_scores",
  1053 + runner=_run_fine_stage,
  1054 + )
  1055 + es_response["hits"]["hits"] = hits
840 1056
841 - context.start_stage(RequestContextStage.RERANKING)  
842 - try:  
843 - final_hits = es_response.get("hits", {}).get("hits") or []  
844 - final_input = final_hits[:rerank_window]  
845 - es_response["hits"]["hits"] = final_input 1057 + def _run_rerank_stage(stage_input: List[Dict[str, Any]]):
  1058 + nonlocal es_response
  1059 +
  1060 + es_response["hits"]["hits"] = stage_input
846 es_response, rerank_meta, fused_debug = run_rerank( 1061 es_response, rerank_meta, fused_debug = run_rerank(
847 query=rerank_query, 1062 query=rerank_query,
848 es_response=es_response, 1063 es_response=es_response,
@@ -858,48 +1073,31 @@ class Searcher: @@ -858,48 +1073,31 @@ class Searcher:
858 service_profile=rc.service_profile, 1073 service_profile=rc.service_profile,
859 style_intent_selected_sku_boost=self.config.query_config.style_intent_selected_sku_boost, 1074 style_intent_selected_sku_boost=self.config.query_config.style_intent_selected_sku_boost,
860 ) 1075 )
861 -  
862 - if rerank_meta is not None:  
863 - if debug:  
864 - rerank_ranks_by_doc = {  
865 - str(hit.get("_id")): rank  
866 - for rank, hit in enumerate(es_response.get("hits", {}).get("hits") or [], 1)  
867 - if hit.get("_id") is not None  
868 - }  
869 - rerank_backend_name, rerank_backend_cfg = get_rerank_backend_config(rc.service_profile)  
870 - rerank_debug_info = {  
871 - "service_profile": rc.service_profile,  
872 - "service_url": get_rerank_service_url(profile=rc.service_profile),  
873 - "backend": rerank_backend_name,  
874 - "model": rerank_meta.get("model") if isinstance(rerank_meta, dict) else None,  
875 - "backend_model_name": rerank_backend_cfg.get("model_name"),  
876 - "query_template": effective_query_template,  
877 - "doc_template": effective_doc_template,  
878 - "query_text": str(effective_query_template).format_map({"query": rerank_query}),  
879 - "docs_in": len(final_input),  
880 - "docs_out": len(es_response.get("hits", {}).get("hits") or []),  
881 - "top_n": from_ + size,  
882 - "meta": rerank_meta,  
883 - "fusion": asdict(rc.fusion),  
884 - }  
885 - context.store_intermediate_result("rerank_scores", fused_debug)  
886 - context.logger.info(  
887 - f"重排完成 | docs={len(es_response.get('hits', {}).get('hits') or [])} | "  
888 - f"top_n={from_ + size} | meta={rerank_meta}",  
889 - extra={'reqid': context.reqid, 'uid': context.uid}  
890 - )  
891 - except Exception as e:  
892 - context.add_warning(f"Rerank failed: {e}")  
893 - context.logger.warning(  
894 - f"调用重排服务失败 | error: {e}",  
895 - extra={'reqid': context.reqid, 'uid': context.uid},  
896 - exc_info=True, 1076 + return (
  1077 + es_response.get("hits", {}).get("hits") or [],
  1078 + rerank_meta is not None,
  1079 + rerank_meta,
  1080 + fused_debug,
897 ) 1081 )
898 - finally:  
899 - context.end_stage(RequestContextStage.RERANKING)  
900 1082
901 - # 当本次请求在重排窗口内时:已按多阶段排序产出前 rerank_window 条,需按请求的 from/size 做分页切片  
902 - if in_rerank_window: 1083 + hits, rerank_ranks_by_doc, rerank_debug_info = _run_optional_stage(
  1084 + stage=RequestContextStage.RERANKING,
  1085 + stage_label="重排",
  1086 + enabled=do_rerank,
  1087 + stage_hits=es_response.get("hits", {}).get("hits") or [],
  1088 + input_limit=rerank_window,
  1089 + output_limit=rerank_window,
  1090 + service_profile=rc.service_profile,
  1091 + query_template=effective_query_template,
  1092 + doc_template=effective_doc_template,
  1093 + top_n=from_ + size,
  1094 + debug_key="rerank_scores",
  1095 + runner=_run_rerank_stage,
  1096 + )
  1097 + es_response["hits"]["hits"] = hits
  1098 +
  1099 + # 当本次请求在排序窗口内时:已按多阶段排序产出前 rerank_window 条,需按请求的 from/size 做分页切片
  1100 + if in_rank_window:
903 hits = es_response.get("hits", {}).get("hits") or [] 1101 hits = es_response.get("hits", {}).get("hits") or []
904 sliced = hits[from_ : from_ + size] 1102 sliced = hits[from_ : from_ + size]
905 es_response.setdefault("hits", {})["hits"] = sliced 1103 es_response.setdefault("hits", {})["hits"] = sliced
@@ -961,12 +1159,12 @@ class Searcher: @@ -961,12 +1159,12 @@ class Searcher:
961 context.end_stage(RequestContextStage.ELASTICSEARCH_PAGE_FILL) 1159 context.end_stage(RequestContextStage.ELASTICSEARCH_PAGE_FILL)
962 1160
963 context.logger.info( 1161 context.logger.info(
964 - f"重排分页切片 | from={from_}, size={size}, 返回={len(sliced)}条", 1162 + f"排序窗口分页切片 | from={from_}, size={size}, 返回={len(sliced)}条",
965 extra={'reqid': context.reqid, 'uid': context.uid} 1163 extra={'reqid': context.reqid, 'uid': context.uid}
966 ) 1164 )
967 1165
968 # 非重排窗口:款式意图在 result_processing 之前执行,便于单独计时且与 ES 召回阶段衔接 1166 # 非重排窗口:款式意图在 result_processing 之前执行,便于单独计时且与 ES 召回阶段衔接
969 - if self._has_style_intent(parsed_query) and not in_rerank_window: 1167 + if self._has_style_intent(parsed_query) and not in_rank_window:
970 es_hits_pre = es_response.get("hits", {}).get("hits") or [] 1168 es_hits_pre = es_response.get("hits", {}).get("hits") or []
971 style_intent_decisions = self._apply_style_intent_to_hits( 1169 style_intent_decisions = self._apply_style_intent_to_hits(
972 es_hits_pre, 1170 es_hits_pre,
@@ -1259,7 +1457,7 @@ class Searcher: @@ -1259,7 +1457,7 @@ class Searcher:
1259 # Collect debug information if requested 1457 # Collect debug information if requested
1260 debug_info = None 1458 debug_info = None
1261 if debug: 1459 if debug:
1262 - query_tokens = getattr(parsed_query, "query_tokens", []) if parsed_query else [] 1460 + query_tokens = parsed_query.query_tokens if parsed_query else []
1263 token_count = len(query_tokens) 1461 token_count = len(query_tokens)
1264 text_knn_is_long = token_count >= 5 1462 text_knn_is_long = token_count >= 5
1265 text_knn_k = self.query_builder.knn_text_k_long if text_knn_is_long else self.query_builder.knn_text_k 1463 text_knn_k = self.query_builder.knn_text_k_long if text_knn_is_long else self.query_builder.knn_text_k
@@ -1279,7 +1477,7 @@ class Searcher: @@ -1279,7 +1477,7 @@ class Searcher:
1279 "translations": context.query_analysis.translations, 1477 "translations": context.query_analysis.translations,
1280 "keywords_queries": context.query_analysis.keywords_queries, 1478 "keywords_queries": context.query_analysis.keywords_queries,
1281 "has_vector": context.query_analysis.query_vector is not None, 1479 "has_vector": context.query_analysis.query_vector is not None,
1282 - "has_image_vector": getattr(parsed_query, "image_query_vector", None) is not None, 1480 + "has_image_vector": parsed_query.image_query_vector is not None,
1283 "query_tokens": query_tokens, 1481 "query_tokens": query_tokens,
1284 "intent_detection": context.get_intermediate_result("style_intent_profile"), 1482 "intent_detection": context.get_intermediate_result("style_intent_profile"),
1285 }, 1483 },
@@ -1298,9 +1496,10 @@ class Searcher: @@ -1298,9 +1496,10 @@ class Searcher:
1298 }, 1496 },
1299 "image_knn": { 1497 "image_knn": {
1300 "enabled": bool( 1498 "enabled": bool(
1301 - enable_embedding 1499 + self.image_embedding_field
  1500 + and enable_embedding
1302 and parsed_query 1501 and parsed_query
1303 - and getattr(parsed_query, "image_query_vector", None) is not None 1502 + and image_query_vector is not None
1304 ), 1503 ),
1305 "k": self.query_builder.knn_image_k, 1504 "k": self.query_builder.knn_image_k,
1306 "num_candidates": self.query_builder.knn_image_num_candidates, 1505 "num_candidates": self.query_builder.knn_image_num_candidates,
@@ -1311,9 +1510,14 @@ class Searcher: @@ -1311,9 +1510,14 @@ class Searcher:
1311 "es_query_context": { 1510 "es_query_context": {
1312 "es_fetch_from": es_fetch_from, 1511 "es_fetch_from": es_fetch_from,
1313 "es_fetch_size": es_fetch_size, 1512 "es_fetch_size": es_fetch_size,
1314 - "in_rerank_window": in_rerank_window,  
1315 - "rerank_prefetch_source": context.get_intermediate_result('es_query_rerank_prefetch_source'),  
1316 - "include_named_queries_score": bool(do_rerank and in_rerank_window), 1513 + "in_rank_window": in_rank_window,
  1514 + "include_named_queries_score": bool(in_rank_window),
  1515 + "exact_knn_rescore_enabled": bool(rc.exact_knn_rescore_enabled and in_rank_window),
  1516 + "exact_knn_rescore_window": (
  1517 + self._resolve_exact_knn_rescore_window()
  1518 + if rc.exact_knn_rescore_enabled and in_rank_window
  1519 + else None
  1520 + ),
1317 }, 1521 },
1318 "es_response": { 1522 "es_response": {
1319 "took_ms": es_response.get('took', 0), 1523 "took_ms": es_response.get('took', 0),
@@ -1369,10 +1573,10 @@ class Searcher: @@ -1369,10 +1573,10 @@ class Searcher:
1369 "retrieval_plan": debug_info["retrieval_plan"], 1573 "retrieval_plan": debug_info["retrieval_plan"],
1370 "ranking_windows": { 1574 "ranking_windows": {
1371 "es_fetch_size": es_fetch_size, 1575 "es_fetch_size": es_fetch_size,
1372 - "coarse_output_window": coarse_output_window if do_rerank and in_rerank_window else None,  
1373 - "fine_input_window": fine_input_window if do_rerank and in_rerank_window else None,  
1374 - "fine_output_window": fine_output_window if do_rerank and in_rerank_window else None,  
1375 - "rerank_window": rerank_window if do_rerank and in_rerank_window else None, 1576 + "coarse_output_window": coarse_output_window if in_rank_window else None,
  1577 + "fine_input_window": fine_input_window if in_rank_window else None,
  1578 + "fine_output_window": fine_output_window if in_rank_window else None,
  1579 + "rerank_window": rerank_window if in_rank_window else None,
1376 "page_from": from_, 1580 "page_from": from_,
1377 "page_size": size, 1581 "page_size": size,
1378 }, 1582 },
suggestion/builder.py
@@ -366,7 +366,8 @@ class SuggestionIndexBuilder: @@ -366,7 +366,8 @@ class SuggestionIndexBuilder:
366 366
367 index_name = get_tenant_index_name(tenant_id) 367 index_name = get_tenant_index_name(tenant_id)
368 search_after: Optional[List[Any]] = None 368 search_after: Optional[List[Any]] = None
369 - 369 + print(f"[DEBUG] Python using index: {index_name} for tenant {tenant_id}")
  370 + total_processed = 0
370 while True: 371 while True:
371 body: Dict[str, Any] = { 372 body: Dict[str, Any] = {
372 "size": batch_size, 373 "size": batch_size,
@@ -385,10 +386,13 @@ class SuggestionIndexBuilder: @@ -385,10 +386,13 @@ class SuggestionIndexBuilder:
385 if not hits: 386 if not hits:
386 break 387 break
387 for hit in hits: 388 for hit in hits:
  389 + total_processed += 1
388 yield hit 390 yield hit
389 search_after = hits[-1].get("sort") 391 search_after = hits[-1].get("sort")
390 if len(hits) < batch_size: 392 if len(hits) < batch_size:
391 break 393 break
  394 + print(f"[DEBUG] Python processed total products: {total_processed} for tenant {tenant_id}")
  395 +
392 396
393 def _iter_query_log_rows( 397 def _iter_query_log_rows(
394 self, 398 self,
suggestion/builder.py.bak 0 → 100644
@@ -0,0 +1,1014 @@ @@ -0,0 +1,1014 @@
  1 +"""
  2 +Suggestion index builder (Phase 2).
  3 +
  4 +Capabilities:
  5 +- Full rebuild to versioned index
  6 +- Atomic alias publish
  7 +- Incremental update from query logs with watermark
  8 +"""
  9 +
  10 +import json
  11 +import logging
  12 +import math
  13 +import re
  14 +import unicodedata
  15 +from dataclasses import dataclass, field
  16 +from datetime import datetime, timedelta, timezone
  17 +from typing import Any, Dict, Iterator, List, Optional, Tuple
  18 +
  19 +from sqlalchemy import text
  20 +
  21 +from config.loader import get_app_config
  22 +from config.tenant_config_loader import get_tenant_config_loader
  23 +from query.query_parser import detect_text_language_for_suggestions
  24 +from suggestion.mapping import build_suggestion_mapping
  25 +from utils.es_client import ESClient
  26 +
  27 +logger = logging.getLogger(__name__)
  28 +
  29 +
  30 +def _index_prefix() -> str:
  31 + return get_app_config().runtime.index_namespace or ""
  32 +
  33 +
  34 +def get_suggestion_alias_name(tenant_id: str) -> str:
  35 + """Read alias for suggestion index (single source of truth)."""
  36 + return f"{_index_prefix()}search_suggestions_tenant_{tenant_id}_current"
  37 +
  38 +
  39 +def get_suggestion_versioned_index_name(tenant_id: str, build_at: Optional[datetime] = None) -> str:
  40 + """Versioned suggestion index name."""
  41 + ts = (build_at or datetime.now(timezone.utc)).strftime("%Y%m%d%H%M%S%f")
  42 + return f"{_index_prefix()}search_suggestions_tenant_{tenant_id}_v{ts}"
  43 +
  44 +
  45 +def get_suggestion_versioned_index_pattern(tenant_id: str) -> str:
  46 + return f"{_index_prefix()}search_suggestions_tenant_{tenant_id}_v*"
  47 +
  48 +
  49 +def get_suggestion_meta_index_name() -> str:
  50 + return f"{_index_prefix()}search_suggestions_meta"
  51 +
  52 +
  53 +@dataclass
  54 +class SuggestionCandidate:
  55 + text: str
  56 + text_norm: str
  57 + lang: str
  58 + sources: set = field(default_factory=set)
  59 + title_spu_ids: set = field(default_factory=set)
  60 + qanchor_spu_ids: set = field(default_factory=set)
  61 + tag_spu_ids: set = field(default_factory=set)
  62 + query_count_7d: int = 0
  63 + query_count_30d: int = 0
  64 + lang_confidence: float = 1.0
  65 + lang_source: str = "default"
  66 + lang_conflict: bool = False
  67 +
  68 + def add_product(self, source: str, spu_id: str) -> None:
  69 + self.sources.add(source)
  70 + if source == "title":
  71 + self.title_spu_ids.add(spu_id)
  72 + elif source == "qanchor":
  73 + self.qanchor_spu_ids.add(spu_id)
  74 + elif source == "tag":
  75 + self.tag_spu_ids.add(spu_id)
  76 +
  77 + def add_query_log(self, is_7d: bool) -> None:
  78 + self.sources.add("query_log")
  79 + self.query_count_30d += 1
  80 + if is_7d:
  81 + self.query_count_7d += 1
  82 +
  83 +
  84 +@dataclass
  85 +class QueryDelta:
  86 + tenant_id: str
  87 + lang: str
  88 + text: str
  89 + text_norm: str
  90 + delta_7d: int = 0
  91 + delta_30d: int = 0
  92 + lang_confidence: float = 1.0
  93 + lang_source: str = "default"
  94 + lang_conflict: bool = False
  95 +
  96 +
  97 +class SuggestionIndexBuilder:
  98 + """Build and update suggestion index."""
  99 +
  100 + def __init__(self, es_client: ESClient, db_engine: Any):
  101 + self.es_client = es_client
  102 + self.db_engine = db_engine
  103 +
  104 + def _format_allocation_failure(self, index_name: str) -> str:
  105 + health = self.es_client.wait_for_index_ready(index_name=index_name, timeout="5s")
  106 + explain = self.es_client.get_allocation_explain(index_name=index_name)
  107 +
  108 + parts = [
  109 + f"Suggestion index '{index_name}' was created but is not allocatable/readable yet",
  110 + f"health_status={health.get('status')}",
  111 + f"timed_out={health.get('timed_out')}",
  112 + ]
  113 + if health.get("error"):
  114 + parts.append(f"health_error={health['error']}")
  115 +
  116 + if explain:
  117 + unassigned = explain.get("unassigned_info") or {}
  118 + if unassigned.get("reason"):
  119 + parts.append(f"unassigned_reason={unassigned['reason']}")
  120 + if unassigned.get("last_allocation_status"):
  121 + parts.append(f"last_allocation_status={unassigned['last_allocation_status']}")
  122 +
  123 + for node in explain.get("node_allocation_decisions") or []:
  124 + node_name = node.get("node_name") or node.get("node_id") or "unknown-node"
  125 + for decider in node.get("deciders") or []:
  126 + if decider.get("decision") == "NO":
  127 + parts.append(
  128 + f"{node_name}:{decider.get('decider')}={decider.get('explanation')}"
  129 + )
  130 + return "; ".join(parts)
  131 +
  132 + return "; ".join(parts)
  133 +
  134 + def _create_fresh_versioned_index(
  135 + self,
  136 + tenant_id: str,
  137 + mapping: Dict[str, Any],
  138 + max_attempts: int = 5,
  139 + ) -> str:
  140 + for attempt in range(1, max_attempts + 1):
  141 + index_name = get_suggestion_versioned_index_name(tenant_id)
  142 + if self.es_client.index_exists(index_name):
  143 + logger.warning(
  144 + "Suggestion index name collision before create for tenant=%s index=%s attempt=%s/%s",
  145 + tenant_id,
  146 + index_name,
  147 + attempt,
  148 + max_attempts,
  149 + )
  150 + continue
  151 +
  152 + if self.es_client.create_index(index_name, mapping):
  153 + return index_name
  154 +
  155 + if self.es_client.index_exists(index_name):
  156 + logger.warning(
  157 + "Suggestion index name collision during create for tenant=%s index=%s attempt=%s/%s",
  158 + tenant_id,
  159 + index_name,
  160 + attempt,
  161 + max_attempts,
  162 + )
  163 + continue
  164 +
  165 + raise RuntimeError(f"Failed to create suggestion index: {index_name}")
  166 +
  167 + raise RuntimeError(
  168 + f"Failed to allocate a unique suggestion index name for tenant={tenant_id} after {max_attempts} attempts"
  169 + )
  170 +
  171 + def _ensure_new_index_ready(self, index_name: str) -> None:
  172 + health = self.es_client.wait_for_index_ready(index_name=index_name, timeout="5s")
  173 + if health.get("ok"):
  174 + return
  175 + raise RuntimeError(self._format_allocation_failure(index_name))
  176 +
  177 + @staticmethod
  178 + def _to_utc(dt: Any) -> Optional[datetime]:
  179 + if dt is None:
  180 + return None
  181 + if isinstance(dt, datetime):
  182 + if dt.tzinfo is None:
  183 + return dt.replace(tzinfo=timezone.utc)
  184 + return dt.astimezone(timezone.utc)
  185 + return None
  186 +
  187 + @staticmethod
  188 + def _normalize_text(value: str) -> str:
  189 + text_value = unicodedata.normalize("NFKC", (value or "")).strip().lower()
  190 + text_value = re.sub(r"\s+", " ", text_value)
  191 + return text_value
  192 +
  193 + @staticmethod
  194 + def _prepare_title_for_suggest(title: str, max_len: int = 120) -> str:
  195 + """
  196 + Keep title-derived suggestions concise:
  197 + - keep raw title when short enough
  198 + - for long titles, keep the leading phrase before common separators
  199 + - fallback to hard truncate
  200 + """
  201 + raw = str(title or "").strip()
  202 + if not raw:
  203 + return ""
  204 + if len(raw) <= max_len:
  205 + return raw
  206 +
  207 + head = re.split(r"[,,;;|/\\\\((\\[【]", raw, maxsplit=1)[0].strip()
  208 + if 1 < len(head) <= max_len:
  209 + return head
  210 +
  211 + truncated = raw[:max_len].rstrip(" ,,;;|/\\\\-—–()()[]【】")
  212 + return truncated or raw[:max_len]
  213 +
  214 + @staticmethod
  215 + def _split_qanchors(value: Any) -> List[str]:
  216 + if value is None:
  217 + return []
  218 + if isinstance(value, list):
  219 + return [str(x).strip() for x in value if str(x).strip()]
  220 + raw = str(value).strip()
  221 + if not raw:
  222 + return []
  223 + parts = re.split(r"[,、,;|/\n\t]+", raw)
  224 + out = [p.strip() for p in parts if p and p.strip()]
  225 + if not out:
  226 + return [raw]
  227 + return out
  228 +
  229 + @staticmethod
  230 + def _iter_product_tags(raw: Any) -> List[str]:
  231 + if raw is None:
  232 + return []
  233 + if isinstance(raw, list):
  234 + return [str(x).strip() for x in raw if str(x).strip()]
  235 + s = str(raw).strip()
  236 + if not s:
  237 + return []
  238 + parts = re.split(r"[,、,;|/\n\t]+", s)
  239 + out = [p.strip() for p in parts if p and p.strip()]
  240 + return out if out else [s]
  241 +
  242 + def _iter_multilang_product_tags(
  243 + self,
  244 + raw: Any,
  245 + index_languages: List[str],
  246 + primary_language: str,
  247 + ) -> List[Tuple[str, str]]:
  248 + if isinstance(raw, dict):
  249 + pairs: List[Tuple[str, str]] = []
  250 + for lang in index_languages:
  251 + for tag in self._iter_product_tags(raw.get(lang)):
  252 + pairs.append((lang, tag))
  253 + return pairs
  254 +
  255 + pairs = []
  256 + for tag in self._iter_product_tags(raw):
  257 + tag_lang, _, _ = detect_text_language_for_suggestions(
  258 + tag,
  259 + index_languages=index_languages,
  260 + primary_language=primary_language,
  261 + )
  262 + pairs.append((tag_lang, tag))
  263 + return pairs
  264 +
  265 + @staticmethod
  266 + def _looks_noise(text_value: str) -> bool:
  267 + if not text_value:
  268 + return True
  269 + if len(text_value) > 120:
  270 + return True
  271 + if re.fullmatch(r"[\W_]+", text_value):
  272 + return True
  273 + return False
  274 +
  275 + @staticmethod
  276 + def _normalize_lang(lang: Optional[str]) -> Optional[str]:
  277 + if not lang:
  278 + return None
  279 + token = str(lang).strip().lower().replace("-", "_")
  280 + if not token:
  281 + return None
  282 + if token in {"zh_tw", "pt_br"}:
  283 + return token
  284 + return token.split("_")[0]
  285 +
  286 + @staticmethod
  287 + def _parse_request_params_language(raw: Any) -> Optional[str]:
  288 + if raw is None:
  289 + return None
  290 + if isinstance(raw, dict):
  291 + return raw.get("language")
  292 + text_raw = str(raw).strip()
  293 + if not text_raw:
  294 + return None
  295 + try:
  296 + obj = json.loads(text_raw)
  297 + if isinstance(obj, dict):
  298 + return obj.get("language")
  299 + except Exception:
  300 + return None
  301 + return None
  302 +
  303 + def _resolve_query_language(
  304 + self,
  305 + query: str,
  306 + log_language: Optional[str],
  307 + request_params: Any,
  308 + index_languages: List[str],
  309 + primary_language: str,
  310 + ) -> Tuple[str, float, str, bool]:
  311 + """Resolve lang with priority: log field > request_params > script/model."""
  312 + langs_set = set(index_languages or [])
  313 + primary = self._normalize_lang(primary_language) or "en"
  314 + if primary not in langs_set and langs_set:
  315 + primary = index_languages[0]
  316 +
  317 + log_lang = self._normalize_lang(log_language)
  318 + req_lang = self._normalize_lang(self._parse_request_params_language(request_params))
  319 + conflict = bool(log_lang and req_lang and log_lang != req_lang)
  320 +
  321 + if log_lang and (not langs_set or log_lang in langs_set):
  322 + return log_lang, 1.0, "log_field", conflict
  323 +
  324 + if req_lang and (not langs_set or req_lang in langs_set):
  325 + return req_lang, 1.0, "request_params", conflict
  326 +
  327 + det_lang, conf, det_source = detect_text_language_for_suggestions(
  328 + query,
  329 + index_languages=index_languages,
  330 + primary_language=primary,
  331 + )
  332 + if det_lang and (not langs_set or det_lang in langs_set):
  333 + return det_lang, conf, det_source, conflict
  334 +
  335 + return primary, 0.3, "default", conflict
  336 +
  337 + @staticmethod
  338 + def _compute_rank_score(
  339 + query_count_30d: int,
  340 + query_count_7d: int,
  341 + qanchor_doc_count: int,
  342 + title_doc_count: int,
  343 + tag_doc_count: int = 0,
  344 + ) -> float:
  345 + return (
  346 + 1.8 * math.log1p(max(query_count_30d, 0))
  347 + + 1.2 * math.log1p(max(query_count_7d, 0))
  348 + + 1.0 * math.log1p(max(qanchor_doc_count, 0))
  349 + + 0.85 * math.log1p(max(tag_doc_count, 0))
  350 + + 0.6 * math.log1p(max(title_doc_count, 0))
  351 + )
  352 +
  353 + @classmethod
  354 + def _compute_rank_score_from_candidate(cls, c: SuggestionCandidate) -> float:
  355 + return cls._compute_rank_score(
  356 + query_count_30d=c.query_count_30d,
  357 + query_count_7d=c.query_count_7d,
  358 + qanchor_doc_count=len(c.qanchor_spu_ids),
  359 + title_doc_count=len(c.title_spu_ids),
  360 + tag_doc_count=len(c.tag_spu_ids),
  361 + )
  362 +
  363 + def _iter_products(self, tenant_id: str, batch_size: int = 500) -> Iterator[Dict[str, Any]]:
  364 + """Stream product docs from tenant index using search_after."""
  365 + from indexer.mapping_generator import get_tenant_index_name
  366 +
  367 + index_name = get_tenant_index_name(tenant_id)
  368 + search_after: Optional[List[Any]] = None
  369 +
  370 + while True:
  371 + body: Dict[str, Any] = {
  372 + "size": batch_size,
  373 + "_source": ["id", "spu_id", "title", "qanchors", "enriched_tags"],
  374 + "sort": [
  375 + {"spu_id": {"order": "asc", "missing": "_last"}},
  376 + {"id.keyword": {"order": "asc", "missing": "_last"}},
  377 + ],
  378 + "query": {"match_all": {}},
  379 + }
  380 + if search_after is not None:
  381 + body["search_after"] = search_after
  382 +
  383 + resp = self.es_client.client.search(index=index_name, body=body)
  384 + hits = resp.get("hits", {}).get("hits", []) or []
  385 + if not hits:
  386 + break
  387 + for hit in hits:
  388 + yield hit
  389 + search_after = hits[-1].get("sort")
  390 + if len(hits) < batch_size:
  391 + break
  392 +
  393 + def _iter_query_log_rows(
  394 + self,
  395 + tenant_id: str,
  396 + since: datetime,
  397 + until: datetime,
  398 + fetch_size: int = 2000,
  399 + ) -> Iterator[Any]:
  400 + """Stream search logs from MySQL with bounded time range."""
  401 + query_sql = text(
  402 + """
  403 + SELECT query, language, request_params, create_time
  404 + FROM shoplazza_search_log
  405 + WHERE tenant_id = :tenant_id
  406 + AND deleted = 0
  407 + AND query IS NOT NULL
  408 + AND query <> ''
  409 + AND create_time >= :since_time
  410 + AND create_time < :until_time
  411 + ORDER BY create_time ASC
  412 + """
  413 + )
  414 +
  415 + with self.db_engine.connect().execution_options(stream_results=True) as conn:
  416 + result = conn.execute(
  417 + query_sql,
  418 + {
  419 + "tenant_id": int(tenant_id),
  420 + "since_time": since,
  421 + "until_time": until,
  422 + },
  423 + )
  424 + while True:
  425 + rows = result.fetchmany(fetch_size)
  426 + if not rows:
  427 + break
  428 + for row in rows:
  429 + yield row
  430 +
  431 + def _ensure_meta_index(self) -> str:
  432 + meta_index = get_suggestion_meta_index_name()
  433 + if self.es_client.index_exists(meta_index):
  434 + return meta_index
  435 + body = {
  436 + "settings": {
  437 + "number_of_shards": 1,
  438 + "number_of_replicas": 0,
  439 + "refresh_interval": "1s",
  440 + },
  441 + "mappings": {
  442 + "properties": {
  443 + "tenant_id": {"type": "keyword"},
  444 + "active_alias": {"type": "keyword"},
  445 + "active_index": {"type": "keyword"},
  446 + "last_full_build_at": {"type": "date"},
  447 + "last_incremental_build_at": {"type": "date"},
  448 + "last_incremental_watermark": {"type": "date"},
  449 + "updated_at": {"type": "date"},
  450 + }
  451 + },
  452 + }
  453 + if not self.es_client.create_index(meta_index, body):
  454 + raise RuntimeError(f"Failed to create suggestion meta index: {meta_index}")
  455 + return meta_index
  456 +
  457 + def _get_meta(self, tenant_id: str) -> Dict[str, Any]:
  458 + meta_index = self._ensure_meta_index()
  459 + try:
  460 + resp = self.es_client.client.get(index=meta_index, id=str(tenant_id))
  461 + return resp.get("_source", {}) or {}
  462 + except Exception:
  463 + return {}
  464 +
  465 + def _upsert_meta(self, tenant_id: str, patch: Dict[str, Any]) -> None:
  466 + meta_index = self._ensure_meta_index()
  467 + current = self._get_meta(tenant_id)
  468 + now_iso = datetime.now(timezone.utc).isoformat()
  469 + merged = {
  470 + "tenant_id": str(tenant_id),
  471 + **current,
  472 + **patch,
  473 + "updated_at": now_iso,
  474 + }
  475 + self.es_client.client.index(index=meta_index, id=str(tenant_id), document=merged, refresh="wait_for")
  476 +
  477 + def _cleanup_old_versions(self, tenant_id: str, keep_versions: int, protected_indices: Optional[List[str]] = None) -> List[str]:
  478 + if keep_versions < 1:
  479 + keep_versions = 1
  480 + protected = set(protected_indices or [])
  481 + pattern = get_suggestion_versioned_index_pattern(tenant_id)
  482 + all_indices = self.es_client.list_indices(pattern)
  483 + if len(all_indices) <= keep_versions:
  484 + return []
  485 +
  486 + # Names are timestamp-ordered by suffix; keep newest N.
  487 + kept = set(sorted(all_indices)[-keep_versions:])
  488 + dropped: List[str] = []
  489 + for idx in sorted(all_indices):
  490 + if idx in kept or idx in protected:
  491 + continue
  492 + if self.es_client.delete_index(idx):
  493 + dropped.append(idx)
  494 + return dropped
  495 +
  496 + def _publish_alias(self, tenant_id: str, index_name: str, keep_versions: int = 2) -> Dict[str, Any]:
  497 + alias_name = get_suggestion_alias_name(tenant_id)
  498 + current_indices = self.es_client.get_alias_indices(alias_name)
  499 +
  500 + actions: List[Dict[str, Any]] = []
  501 + for idx in current_indices:
  502 + actions.append({"remove": {"index": idx, "alias": alias_name}})
  503 + actions.append({"add": {"index": index_name, "alias": alias_name}})
  504 +
  505 + if not self.es_client.update_aliases(actions):
  506 + raise RuntimeError(f"Failed to publish alias {alias_name} -> {index_name}")
  507 +
  508 + dropped = self._cleanup_old_versions(
  509 + tenant_id=tenant_id,
  510 + keep_versions=keep_versions,
  511 + protected_indices=[index_name],
  512 + )
  513 +
  514 + self._upsert_meta(
  515 + tenant_id,
  516 + {
  517 + "active_alias": alias_name,
  518 + "active_index": index_name,
  519 + },
  520 + )
  521 +
  522 + return {
  523 + "alias": alias_name,
  524 + "previous_indices": current_indices,
  525 + "current_index": index_name,
  526 + "dropped_old_indices": dropped,
  527 + }
  528 +
  529 + def _resolve_incremental_target_index(self, tenant_id: str) -> Optional[str]:
  530 + """Resolve active suggestion index for incremental updates (alias only)."""
  531 + alias_name = get_suggestion_alias_name(tenant_id)
  532 + aliased = self.es_client.get_alias_indices(alias_name)
  533 + if aliased:
  534 + # alias should map to one index in this design
  535 + return sorted(aliased)[-1]
  536 + return None
  537 +
  538 + def _build_full_candidates(
  539 + self,
  540 + tenant_id: str,
  541 + index_languages: List[str],
  542 + primary_language: str,
  543 + days: int,
  544 + batch_size: int,
  545 + min_query_len: int,
  546 + ) -> Dict[Tuple[str, str], SuggestionCandidate]:
  547 + key_to_candidate: Dict[Tuple[str, str], SuggestionCandidate] = {}
  548 +
  549 + # Step 1: product title/qanchors
  550 + for hit in self._iter_products(tenant_id, batch_size=batch_size):
  551 + src = hit.get("_source", {}) or {}
  552 + product_id = str(src.get("spu_id") or src.get("id") or hit.get("_id") or "")
  553 + if not product_id:
  554 + continue
  555 + title_obj = src.get("title") or {}
  556 + qanchor_obj = src.get("qanchors") or {}
  557 +
  558 + for lang in index_languages:
  559 + title = ""
  560 + if isinstance(title_obj, dict):
  561 + title = self._prepare_title_for_suggest(title_obj.get(lang) or "")
  562 + if title:
  563 + text_norm = self._normalize_text(title)
  564 + if not self._looks_noise(text_norm):
  565 + key = (lang, text_norm)
  566 + c = key_to_candidate.get(key)
  567 + if c is None:
  568 + c = SuggestionCandidate(text=title, text_norm=text_norm, lang=lang)
  569 + key_to_candidate[key] = c
  570 + c.add_product("title", spu_id=product_id)
  571 +
  572 + q_raw = None
  573 + if isinstance(qanchor_obj, dict):
  574 + q_raw = qanchor_obj.get(lang)
  575 + for q_text in self._split_qanchors(q_raw):
  576 + text_norm = self._normalize_text(q_text)
  577 + if self._looks_noise(text_norm):
  578 + continue
  579 + key = (lang, text_norm)
  580 + c = key_to_candidate.get(key)
  581 + if c is None:
  582 + c = SuggestionCandidate(text=q_text, text_norm=text_norm, lang=lang)
  583 + key_to_candidate[key] = c
  584 + c.add_product("qanchor", spu_id=product_id)
  585 +
  586 + for tag_lang, tag in self._iter_multilang_product_tags(
  587 + src.get("enriched_tags"),
  588 + index_languages=index_languages,
  589 + primary_language=primary_language,
  590 + ):
  591 + text_norm = self._normalize_text(tag)
  592 + if self._looks_noise(text_norm):
  593 + continue
  594 + key = (tag_lang, text_norm)
  595 + c = key_to_candidate.get(key)
  596 + if c is None:
  597 + c = SuggestionCandidate(text=tag, text_norm=text_norm, lang=tag_lang)
  598 + key_to_candidate[key] = c
  599 + c.add_product("tag", spu_id=product_id)
  600 +
  601 + # Step 2: query logs
  602 + now = datetime.now(timezone.utc)
  603 + since = now - timedelta(days=days)
  604 + since_7d = now - timedelta(days=7)
  605 +
  606 + for row in self._iter_query_log_rows(tenant_id=tenant_id, since=since, until=now):
  607 + q = str(row.query or "").strip()
  608 + if len(q) < min_query_len:
  609 + continue
  610 +
  611 + lang, conf, source, conflict = self._resolve_query_language(
  612 + query=q,
  613 + log_language=getattr(row, "language", None),
  614 + request_params=getattr(row, "request_params", None),
  615 + index_languages=index_languages,
  616 + primary_language=primary_language,
  617 + )
  618 + text_norm = self._normalize_text(q)
  619 + if self._looks_noise(text_norm):
  620 + continue
  621 +
  622 + key = (lang, text_norm)
  623 + c = key_to_candidate.get(key)
  624 + if c is None:
  625 + c = SuggestionCandidate(text=q, text_norm=text_norm, lang=lang)
  626 + key_to_candidate[key] = c
  627 +
  628 + c.lang_confidence = max(c.lang_confidence, conf)
  629 + c.lang_source = source if c.lang_source == "default" else c.lang_source
  630 + c.lang_conflict = c.lang_conflict or conflict
  631 +
  632 + created_at = self._to_utc(getattr(row, "create_time", None))
  633 + is_7d = bool(created_at and created_at >= since_7d)
  634 + c.add_query_log(is_7d=is_7d)
  635 +
  636 + return key_to_candidate
  637 +
  638 + def _candidate_to_doc(self, tenant_id: str, c: SuggestionCandidate, now_iso: str) -> Dict[str, Any]:
  639 + rank_score = self._compute_rank_score_from_candidate(c)
  640 + completion_obj = {c.lang: {"input": [c.text], "weight": int(max(rank_score, 1.0) * 100)}}
  641 + sat_obj = {c.lang: c.text}
  642 + return {
  643 + "_id": f"{tenant_id}|{c.lang}|{c.text_norm}",
  644 + "tenant_id": str(tenant_id),
  645 + "lang": c.lang,
  646 + "text": c.text,
  647 + "text_norm": c.text_norm,
  648 + "sources": sorted(c.sources),
  649 + "title_doc_count": len(c.title_spu_ids),
  650 + "qanchor_doc_count": len(c.qanchor_spu_ids),
  651 + "tag_doc_count": len(c.tag_spu_ids),
  652 + "query_count_7d": c.query_count_7d,
  653 + "query_count_30d": c.query_count_30d,
  654 + "rank_score": float(rank_score),
  655 + "lang_confidence": float(c.lang_confidence),
  656 + "lang_source": c.lang_source,
  657 + "lang_conflict": bool(c.lang_conflict),
  658 + "status": 1,
  659 + "updated_at": now_iso,
  660 + "completion": completion_obj,
  661 + "sat": sat_obj,
  662 + }
  663 +
  664 + def rebuild_tenant_index(
  665 + self,
  666 + tenant_id: str,
  667 + days: int = 365,
  668 + batch_size: int = 500,
  669 + min_query_len: int = 1,
  670 + publish_alias: bool = True,
  671 + keep_versions: int = 2,
  672 + ) -> Dict[str, Any]:
  673 + """
  674 + Full rebuild.
  675 +
  676 + Phase2 default behavior:
  677 + - write to versioned index
  678 + - atomically publish alias
  679 + """
  680 + tenant_loader = get_tenant_config_loader()
  681 + tenant_cfg = tenant_loader.get_tenant_config(tenant_id)
  682 + index_languages: List[str] = tenant_cfg.get("index_languages") or ["en", "zh"]
  683 + primary_language: str = tenant_cfg.get("primary_language") or "en"
  684 +
  685 + alias_publish: Optional[Dict[str, Any]] = None
  686 + index_name: Optional[str] = None
  687 + try:
  688 + mapping = build_suggestion_mapping(index_languages=index_languages)
  689 + index_name = self._create_fresh_versioned_index(
  690 + tenant_id=tenant_id,
  691 + mapping=mapping,
  692 + )
  693 + self._ensure_new_index_ready(index_name)
  694 +
  695 + key_to_candidate = self._build_full_candidates(
  696 + tenant_id=tenant_id,
  697 + index_languages=index_languages,
  698 + primary_language=primary_language,
  699 + days=days,
  700 + batch_size=batch_size,
  701 + min_query_len=min_query_len,
  702 + )
  703 +
  704 + now_iso = datetime.now(timezone.utc).isoformat()
  705 + docs = [self._candidate_to_doc(tenant_id, c, now_iso) for c in key_to_candidate.values()]
  706 +
  707 + if docs:
  708 + bulk_result = self.es_client.bulk_index(index_name=index_name, docs=docs)
  709 + self.es_client.refresh(index_name)
  710 + else:
  711 + bulk_result = {"success": 0, "failed": 0, "errors": []}
  712 +
  713 + if publish_alias:
  714 + alias_publish = self._publish_alias(
  715 + tenant_id=tenant_id,
  716 + index_name=index_name,
  717 + keep_versions=keep_versions,
  718 + )
  719 +
  720 + now_utc = datetime.now(timezone.utc).isoformat()
  721 + meta_patch: Dict[str, Any] = {
  722 + "last_full_build_at": now_utc,
  723 + "last_incremental_watermark": now_utc,
  724 + }
  725 + if publish_alias:
  726 + meta_patch["active_index"] = index_name
  727 + meta_patch["active_alias"] = get_suggestion_alias_name(tenant_id)
  728 + self._upsert_meta(tenant_id, meta_patch)
  729 +
  730 + return {
  731 + "mode": "full",
  732 + "tenant_id": str(tenant_id),
  733 + "index_name": index_name,
  734 + "alias_published": bool(alias_publish),
  735 + "alias_publish": alias_publish,
  736 + "total_candidates": len(key_to_candidate),
  737 + "indexed_docs": len(docs),
  738 + "bulk_result": bulk_result,
  739 + }
  740 + except Exception:
  741 + if index_name and not alias_publish:
  742 + self.es_client.delete_index(index_name)
  743 + raise
  744 +
  745 + def _build_incremental_deltas(
  746 + self,
  747 + tenant_id: str,
  748 + index_languages: List[str],
  749 + primary_language: str,
  750 + since: datetime,
  751 + until: datetime,
  752 + min_query_len: int,
  753 + ) -> Dict[Tuple[str, str], QueryDelta]:
  754 + now = datetime.now(timezone.utc)
  755 + since_7d = now - timedelta(days=7)
  756 + deltas: Dict[Tuple[str, str], QueryDelta] = {}
  757 +
  758 + for row in self._iter_query_log_rows(tenant_id=tenant_id, since=since, until=until):
  759 + q = str(row.query or "").strip()
  760 + if len(q) < min_query_len:
  761 + continue
  762 +
  763 + lang, conf, source, conflict = self._resolve_query_language(
  764 + query=q,
  765 + log_language=getattr(row, "language", None),
  766 + request_params=getattr(row, "request_params", None),
  767 + index_languages=index_languages,
  768 + primary_language=primary_language,
  769 + )
  770 + text_norm = self._normalize_text(q)
  771 + if self._looks_noise(text_norm):
  772 + continue
  773 +
  774 + key = (lang, text_norm)
  775 + item = deltas.get(key)
  776 + if item is None:
  777 + item = QueryDelta(
  778 + tenant_id=str(tenant_id),
  779 + lang=lang,
  780 + text=q,
  781 + text_norm=text_norm,
  782 + lang_confidence=conf,
  783 + lang_source=source,
  784 + lang_conflict=conflict,
  785 + )
  786 + deltas[key] = item
  787 +
  788 + created_at = self._to_utc(getattr(row, "create_time", None))
  789 + item.delta_30d += 1
  790 + if created_at and created_at >= since_7d:
  791 + item.delta_7d += 1
  792 +
  793 + if conf > item.lang_confidence:
  794 + item.lang_confidence = conf
  795 + item.lang_source = source
  796 + item.lang_conflict = item.lang_conflict or conflict
  797 +
  798 + return deltas
  799 +
  800 + def _delta_to_upsert_doc(self, delta: QueryDelta, now_iso: str) -> Dict[str, Any]:
  801 + rank_score = self._compute_rank_score(
  802 + query_count_30d=delta.delta_30d,
  803 + query_count_7d=delta.delta_7d,
  804 + qanchor_doc_count=0,
  805 + title_doc_count=0,
  806 + tag_doc_count=0,
  807 + )
  808 + return {
  809 + "tenant_id": delta.tenant_id,
  810 + "lang": delta.lang,
  811 + "text": delta.text,
  812 + "text_norm": delta.text_norm,
  813 + "sources": ["query_log"],
  814 + "title_doc_count": 0,
  815 + "qanchor_doc_count": 0,
  816 + "tag_doc_count": 0,
  817 + "query_count_7d": delta.delta_7d,
  818 + "query_count_30d": delta.delta_30d,
  819 + "rank_score": float(rank_score),
  820 + "lang_confidence": float(delta.lang_confidence),
  821 + "lang_source": delta.lang_source,
  822 + "lang_conflict": bool(delta.lang_conflict),
  823 + "status": 1,
  824 + "updated_at": now_iso,
  825 + "completion": {
  826 + delta.lang: {
  827 + "input": [delta.text],
  828 + "weight": int(max(rank_score, 1.0) * 100),
  829 + }
  830 + },
  831 + "sat": {delta.lang: delta.text},
  832 + }
  833 +
  834 + @staticmethod
  835 + def _build_incremental_update_script() -> str:
  836 + return """
  837 + if (ctx._source == null || ctx._source.isEmpty()) {
  838 + ctx._source = params.upsert;
  839 + return;
  840 + }
  841 +
  842 + if (ctx._source.query_count_30d == null) { ctx._source.query_count_30d = 0; }
  843 + if (ctx._source.query_count_7d == null) { ctx._source.query_count_7d = 0; }
  844 + if (ctx._source.qanchor_doc_count == null) { ctx._source.qanchor_doc_count = 0; }
  845 + if (ctx._source.title_doc_count == null) { ctx._source.title_doc_count = 0; }
  846 + if (ctx._source.tag_doc_count == null) { ctx._source.tag_doc_count = 0; }
  847 +
  848 + ctx._source.query_count_30d += params.delta_30d;
  849 + ctx._source.query_count_7d += params.delta_7d;
  850 +
  851 + if (ctx._source.sources == null) { ctx._source.sources = new ArrayList(); }
  852 + if (!ctx._source.sources.contains('query_log')) { ctx._source.sources.add('query_log'); }
  853 +
  854 + if (ctx._source.lang_conflict == null) { ctx._source.lang_conflict = false; }
  855 + ctx._source.lang_conflict = ctx._source.lang_conflict || params.lang_conflict;
  856 +
  857 + if (ctx._source.lang_confidence == null || params.lang_confidence > ctx._source.lang_confidence) {
  858 + ctx._source.lang_confidence = params.lang_confidence;
  859 + ctx._source.lang_source = params.lang_source;
  860 + }
  861 +
  862 + int q30 = ctx._source.query_count_30d;
  863 + int q7 = ctx._source.query_count_7d;
  864 + int qa = ctx._source.qanchor_doc_count;
  865 + int td = ctx._source.title_doc_count;
  866 + int tg = ctx._source.tag_doc_count;
  867 +
  868 + double score = 1.8 * Math.log(1 + q30)
  869 + + 1.2 * Math.log(1 + q7)
  870 + + 1.0 * Math.log(1 + qa)
  871 + + 0.85 * Math.log(1 + tg)
  872 + + 0.6 * Math.log(1 + td);
  873 + ctx._source.rank_score = score;
  874 + ctx._source.status = 1;
  875 + ctx._source.updated_at = params.now_iso;
  876 + ctx._source.text = params.text;
  877 + ctx._source.lang = params.lang;
  878 + ctx._source.text_norm = params.text_norm;
  879 +
  880 + if (ctx._source.completion == null) { ctx._source.completion = new HashMap(); }
  881 + Map c = new HashMap();
  882 + c.put('input', params.completion_input);
  883 + c.put('weight', params.completion_weight);
  884 + ctx._source.completion.put(params.lang, c);
  885 +
  886 + if (ctx._source.sat == null) { ctx._source.sat = new HashMap(); }
  887 + ctx._source.sat.put(params.lang, params.text);
  888 + """
  889 +
  890 + def _build_incremental_actions(self, target_index: str, deltas: Dict[Tuple[str, str], QueryDelta]) -> List[Dict[str, Any]]:
  891 + now_iso = datetime.now(timezone.utc).isoformat()
  892 + script_source = self._build_incremental_update_script()
  893 + actions: List[Dict[str, Any]] = []
  894 +
  895 + for delta in deltas.values():
  896 + upsert_doc = self._delta_to_upsert_doc(delta=delta, now_iso=now_iso)
  897 + upsert_rank = float(upsert_doc.get("rank_score") or 0.0)
  898 + action = {
  899 + "_op_type": "update",
  900 + "_index": target_index,
  901 + "_id": f"{delta.tenant_id}|{delta.lang}|{delta.text_norm}",
  902 + "scripted_upsert": True,
  903 + "script": {
  904 + "lang": "painless",
  905 + "source": script_source,
  906 + "params": {
  907 + "delta_30d": int(delta.delta_30d),
  908 + "delta_7d": int(delta.delta_7d),
  909 + "lang_confidence": float(delta.lang_confidence),
  910 + "lang_source": delta.lang_source,
  911 + "lang_conflict": bool(delta.lang_conflict),
  912 + "now_iso": now_iso,
  913 + "lang": delta.lang,
  914 + "text": delta.text,
  915 + "text_norm": delta.text_norm,
  916 + "completion_input": [delta.text],
  917 + "completion_weight": int(max(upsert_rank, 1.0) * 100),
  918 + "upsert": upsert_doc,
  919 + },
  920 + },
  921 + "upsert": upsert_doc,
  922 + }
  923 + actions.append(action)
  924 +
  925 + return actions
  926 +
  927 + def incremental_update_tenant_index(
  928 + self,
  929 + tenant_id: str,
  930 + min_query_len: int = 1,
  931 + fallback_days: int = 7,
  932 + overlap_minutes: int = 30,
  933 + bootstrap_if_missing: bool = True,
  934 + bootstrap_days: int = 30,
  935 + batch_size: int = 500,
  936 + ) -> Dict[str, Any]:
  937 + tenant_loader = get_tenant_config_loader()
  938 + tenant_cfg = tenant_loader.get_tenant_config(tenant_id)
  939 + index_languages: List[str] = tenant_cfg.get("index_languages") or ["en", "zh"]
  940 + primary_language: str = tenant_cfg.get("primary_language") or "en"
  941 +
  942 + target_index = self._resolve_incremental_target_index(tenant_id)
  943 + if not target_index:
  944 + if not bootstrap_if_missing:
  945 + raise RuntimeError(
  946 + f"No active suggestion index for tenant={tenant_id}. "
  947 + "Run full rebuild first or enable bootstrap_if_missing."
  948 + )
  949 + full_result = self.rebuild_tenant_index(
  950 + tenant_id=tenant_id,
  951 + days=bootstrap_days,
  952 + batch_size=batch_size,
  953 + min_query_len=min_query_len,
  954 + publish_alias=True
  955 + )
  956 + return {
  957 + "mode": "incremental",
  958 + "tenant_id": str(tenant_id),
  959 + "bootstrapped": True,
  960 + "bootstrap_result": full_result,
  961 + }
  962 +
  963 + meta = self._get_meta(tenant_id)
  964 + watermark_raw = meta.get("last_incremental_watermark") or meta.get("last_full_build_at")
  965 + now = datetime.now(timezone.utc)
  966 + default_since = now - timedelta(days=fallback_days)
  967 + since = None
  968 + if isinstance(watermark_raw, str) and watermark_raw.strip():
  969 + try:
  970 + since = self._to_utc(datetime.fromisoformat(watermark_raw.replace("Z", "+00:00")))
  971 + except Exception:
  972 + since = None
  973 + if since is None:
  974 + since = default_since
  975 + since = since - timedelta(minutes=max(overlap_minutes, 0))
  976 + if since < default_since:
  977 + since = default_since
  978 +
  979 + deltas = self._build_incremental_deltas(
  980 + tenant_id=tenant_id,
  981 + index_languages=index_languages,
  982 + primary_language=primary_language,
  983 + since=since,
  984 + until=now,
  985 + min_query_len=min_query_len,
  986 + )
  987 +
  988 + actions = self._build_incremental_actions(target_index=target_index, deltas=deltas)
  989 + bulk_result = self.es_client.bulk_actions(actions)
  990 + self.es_client.refresh(target_index)
  991 +
  992 + now_iso = now.isoformat()
  993 + self._upsert_meta(
  994 + tenant_id,
  995 + {
  996 + "last_incremental_build_at": now_iso,
  997 + "last_incremental_watermark": now_iso,
  998 + "active_index": target_index,
  999 + "active_alias": get_suggestion_alias_name(tenant_id),
  1000 + },
  1001 + )
  1002 +
  1003 + return {
  1004 + "mode": "incremental",
  1005 + "tenant_id": str(tenant_id),
  1006 + "target_index": target_index,
  1007 + "query_window": {
  1008 + "since": since.isoformat(),
  1009 + "until": now_iso,
  1010 + "overlap_minutes": int(overlap_minutes),
  1011 + },
  1012 + "updated_terms": len(deltas),
  1013 + "bulk_result": bulk_result,
  1014 + }
tests/ci/test_service_api_contracts.py
@@ -345,8 +345,15 @@ def test_indexer_build_docs_from_db_contract(indexer_client: TestClient): @@ -345,8 +345,15 @@ def test_indexer_build_docs_from_db_contract(indexer_client: TestClient):
345 def test_indexer_enrich_content_contract(indexer_client: TestClient, monkeypatch): 345 def test_indexer_enrich_content_contract(indexer_client: TestClient, monkeypatch):
346 import indexer.product_enrich as process_products 346 import indexer.product_enrich as process_products
347 347
348 - def _fake_build_index_content_fields(items: List[Dict[str, str]], tenant_id: str | None = None): 348 + def _fake_build_index_content_fields(
  349 + items: List[Dict[str, str]],
  350 + tenant_id: str | None = None,
  351 + enrichment_scopes: List[str] | None = None,
  352 + category_taxonomy_profile: str = "apparel",
  353 + ):
349 assert tenant_id == "162" 354 assert tenant_id == "162"
  355 + assert enrichment_scopes == ["generic", "category_taxonomy"]
  356 + assert category_taxonomy_profile == "apparel"
350 return [ 357 return [
351 { 358 {
352 "id": p["spu_id"], 359 "id": p["spu_id"],
@@ -358,6 +365,9 @@ def test_indexer_enrich_content_contract(indexer_client: TestClient, monkeypatch @@ -358,6 +365,9 @@ def test_indexer_enrich_content_contract(indexer_client: TestClient, monkeypatch
358 "enriched_attributes": [ 365 "enriched_attributes": [
359 {"name": "enriched_tags", "value": {"zh": ["tag1"], "en": ["tag1"]}}, 366 {"name": "enriched_tags", "value": {"zh": ["tag1"], "en": ["tag1"]}},
360 ], 367 ],
  368 + "enriched_taxonomy_attributes": [
  369 + {"name": "Product Type", "value": {"zh": ["T恤"], "en": ["t-shirt"]}},
  370 + ],
361 } 371 }
362 for p in items 372 for p in items
363 ] 373 ]
@@ -368,6 +378,8 @@ def test_indexer_enrich_content_contract(indexer_client: TestClient, monkeypatch @@ -368,6 +378,8 @@ def test_indexer_enrich_content_contract(indexer_client: TestClient, monkeypatch
368 "/indexer/enrich-content", 378 "/indexer/enrich-content",
369 json={ 379 json={
370 "tenant_id": "162", 380 "tenant_id": "162",
  381 + "enrichment_scopes": ["generic", "category_taxonomy"],
  382 + "category_taxonomy_profile": "apparel",
371 "items": [ 383 "items": [
372 {"spu_id": "1001", "title": "T-shirt"}, 384 {"spu_id": "1001", "title": "T-shirt"},
373 {"spu_id": "1002", "title": "Toy"}, 385 {"spu_id": "1002", "title": "Toy"},
@@ -377,6 +389,8 @@ def test_indexer_enrich_content_contract(indexer_client: TestClient, monkeypatch @@ -377,6 +389,8 @@ def test_indexer_enrich_content_contract(indexer_client: TestClient, monkeypatch
377 assert response.status_code == 200 389 assert response.status_code == 200
378 data = response.json() 390 data = response.json()
379 assert data["tenant_id"] == "162" 391 assert data["tenant_id"] == "162"
  392 + assert data["enrichment_scopes"] == ["generic", "category_taxonomy"]
  393 + assert data["category_taxonomy_profile"] == "apparel"
380 assert data["total"] == 2 394 assert data["total"] == 2
381 assert len(data["results"]) == 2 395 assert len(data["results"]) == 2
382 assert data["results"][0]["spu_id"] == "1001" 396 assert data["results"][0]["spu_id"] == "1001"
@@ -388,6 +402,102 @@ def test_indexer_enrich_content_contract(indexer_client: TestClient, monkeypatch @@ -388,6 +402,102 @@ def test_indexer_enrich_content_contract(indexer_client: TestClient, monkeypatch
388 "name": "enriched_tags", 402 "name": "enriched_tags",
389 "value": {"zh": ["tag1"], "en": ["tag1"]}, 403 "value": {"zh": ["tag1"], "en": ["tag1"]},
390 } 404 }
  405 + assert data["results"][0]["enriched_taxonomy_attributes"][0] == {
  406 + "name": "Product Type",
  407 + "value": {"zh": ["T恤"], "en": ["t-shirt"]},
  408 + }
  409 +
  410 +
  411 +def test_indexer_enrich_content_contract_accepts_deprecated_analysis_kinds(indexer_client: TestClient, monkeypatch):
  412 + import indexer.product_enrich as process_products
  413 +
  414 + seen: Dict[str, Any] = {}
  415 +
  416 + def _fake_build_index_content_fields(
  417 + items: List[Dict[str, str]],
  418 + tenant_id: str | None = None,
  419 + enrichment_scopes: List[str] | None = None,
  420 + category_taxonomy_profile: str = "apparel",
  421 + ):
  422 + seen["tenant_id"] = tenant_id
  423 + seen["enrichment_scopes"] = enrichment_scopes
  424 + seen["category_taxonomy_profile"] = category_taxonomy_profile
  425 + return [
  426 + {
  427 + "id": items[0]["spu_id"],
  428 + "qanchors": {},
  429 + "enriched_tags": {},
  430 + "enriched_attributes": [],
  431 + "enriched_taxonomy_attributes": [],
  432 + }
  433 + ]
  434 +
  435 + monkeypatch.setattr(process_products, "build_index_content_fields", _fake_build_index_content_fields)
  436 +
  437 + response = indexer_client.post(
  438 + "/indexer/enrich-content",
  439 + json={
  440 + "tenant_id": "162",
  441 + "analysis_kinds": ["taxonomy"],
  442 + "items": [{"spu_id": "1001", "title": "T-shirt"}],
  443 + },
  444 + )
  445 +
  446 + assert response.status_code == 200
  447 + data = response.json()
  448 + assert seen == {
  449 + "tenant_id": "162",
  450 + "enrichment_scopes": ["category_taxonomy"],
  451 + "category_taxonomy_profile": "apparel",
  452 + }
  453 + assert data["enrichment_scopes"] == ["category_taxonomy"]
  454 + assert data["category_taxonomy_profile"] == "apparel"
  455 +
  456 +
  457 +def test_indexer_enrich_content_contract_supports_non_apparel_taxonomy_profiles(indexer_client: TestClient, monkeypatch):
  458 + import indexer.product_enrich as process_products
  459 +
  460 + def _fake_build_index_content_fields(
  461 + items: List[Dict[str, str]],
  462 + tenant_id: str | None = None,
  463 + enrichment_scopes: List[str] | None = None,
  464 + category_taxonomy_profile: str = "apparel",
  465 + ):
  466 + assert tenant_id == "162"
  467 + assert enrichment_scopes == ["category_taxonomy"]
  468 + assert category_taxonomy_profile == "toys"
  469 + return [
  470 + {
  471 + "id": items[0]["spu_id"],
  472 + "qanchors": {},
  473 + "enriched_tags": {},
  474 + "enriched_attributes": [],
  475 + "enriched_taxonomy_attributes": [
  476 + {"name": "Product Type", "value": {"en": ["doll set"]}},
  477 + {"name": "Age Group", "value": {"en": ["kids"]}},
  478 + ],
  479 + }
  480 + ]
  481 +
  482 + monkeypatch.setattr(process_products, "build_index_content_fields", _fake_build_index_content_fields)
  483 +
  484 + response = indexer_client.post(
  485 + "/indexer/enrich-content",
  486 + json={
  487 + "tenant_id": "162",
  488 + "enrichment_scopes": ["category_taxonomy"],
  489 + "category_taxonomy_profile": "toys",
  490 + "items": [{"spu_id": "1001", "title": "Toy"}],
  491 + },
  492 + )
  493 +
  494 + assert response.status_code == 200
  495 + data = response.json()
  496 + assert data["category_taxonomy_profile"] == "toys"
  497 + assert data["results"][0]["enriched_taxonomy_attributes"] == [
  498 + {"name": "Product Type", "value": {"en": ["doll set"]}},
  499 + {"name": "Age Group", "value": {"en": ["kids"]}},
  500 + ]
391 501
392 502
393 def test_indexer_documents_contract(indexer_client: TestClient): 503 def test_indexer_documents_contract(indexer_client: TestClient):
tests/manual/README.md 0 → 100644
@@ -0,0 +1,5 @@ @@ -0,0 +1,5 @@
  1 +# Manual Tests
  2 +
  3 +`tests/manual/` 存放需要人工启动依赖服务、手动观察结果或依赖真实外部环境的试跑脚本。
  4 +
  5 +这类脚本不属于 `pytest` 自动回归范围,也不应与 `tests/ci` 的契约测试混为一类。
scripts/test_build_docs_api.py renamed to tests/manual/test_build_docs_api.py
@@ -4,9 +4,9 @@ @@ -4,9 +4,9 @@
4 4
5 用法: 5 用法:
6 1. 先启动 Indexer 服务: ./scripts/start_indexer.sh (或 uvicorn api.indexer_app:app --port 6004) 6 1. 先启动 Indexer 服务: ./scripts/start_indexer.sh (或 uvicorn api.indexer_app:app --port 6004)
7 - 2. 执行: python scripts/test_build_docs_api.py 7 + 2. 执行: python tests/manual/test_build_docs_api.py
8 8
9 - 也可指定地址: INDEXER_URL=http://localhost:6004 python scripts/test_build_docs_api.py 9 + 也可指定地址: INDEXER_URL=http://localhost:6004 python tests/manual/test_build_docs_api.py
10 """ 10 """
11 11
12 import json 12 import json
@@ -15,7 +15,7 @@ import sys @@ -15,7 +15,7 @@ import sys
15 from datetime import datetime, timezone 15 from datetime import datetime, timezone
16 16
17 # 项目根目录 17 # 项目根目录
18 -ROOT = os.path.dirname(os.path.dirname(os.path.abspath(__file__))) 18 +ROOT = os.path.dirname(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
19 sys.path.insert(0, ROOT) 19 sys.path.insert(0, ROOT)
20 20
21 # 默认使用 requests 调真实服务;若未安装则回退到 TestClient 21 # 默认使用 requests 调真实服务;若未安装则回退到 TestClient
@@ -122,7 +122,7 @@ def main(): @@ -122,7 +122,7 @@ def main():
122 print("\n[错误] 无法连接 Indexer 服务:", e) 122 print("\n[错误] 无法连接 Indexer 服务:", e)
123 print("请先启动: ./scripts/start_indexer.sh 或 uvicorn api.indexer_app:app --port 6004") 123 print("请先启动: ./scripts/start_indexer.sh 或 uvicorn api.indexer_app:app --port 6004")
124 if HAS_REQUESTS: 124 if HAS_REQUESTS:
125 - print("或使用进程内测试: USE_TEST_CLIENT=1 python scripts/test_build_docs_api.py") 125 + print("或使用进程内测试: USE_TEST_CLIENT=1 python tests/manual/test_build_docs_api.py")
126 sys.exit(1) 126 sys.exit(1)
127 else: 127 else:
128 if not use_http and not HAS_REQUESTS: 128 if not use_http and not HAS_REQUESTS:
tests/test_embedding_pipeline.py
  1 +from dataclasses import asdict
1 from typing import Any, Dict, List, Optional 2 from typing import Any, Dict, List, Optional
2 3
3 import numpy as np 4 import numpy as np
tests/test_es_query_builder.py
@@ -208,3 +208,36 @@ def test_image_knn_clause_is_added_alongside_base_translation_and_text_knn(): @@ -208,3 +208,36 @@ def test_image_knn_clause_is_added_alongside_base_translation_and_text_knn():
208 assert image_knn["path"] == "image_embedding" 208 assert image_knn["path"] == "image_embedding"
209 assert image_knn["score_mode"] == "max" 209 assert image_knn["score_mode"] == "max"
210 assert image_knn["query"]["knn"]["field"] == "image_embedding.vector" 210 assert image_knn["query"]["knn"]["field"] == "image_embedding.vector"
  211 +
  212 +
  213 +def test_text_knn_plan_is_reused_for_ann_and_exact_rescore():
  214 + qb = _builder()
  215 + parsed_query = SimpleNamespace(query_tokens=["a", "b", "c", "d", "e"])
  216 +
  217 + ann_clause = qb.build_text_knn_clause(
  218 + np.array([0.1, 0.2, 0.3]),
  219 + parsed_query=parsed_query,
  220 + )
  221 + exact_clause = qb.build_exact_text_knn_rescore_clause(
  222 + np.array([0.1, 0.2, 0.3]),
  223 + parsed_query=parsed_query,
  224 + )
  225 +
  226 + assert ann_clause is not None
  227 + assert exact_clause is not None
  228 + assert ann_clause["knn"]["k"] == qb.knn_text_k_long
  229 + assert ann_clause["knn"]["num_candidates"] == qb.knn_text_num_candidates_long
  230 + assert ann_clause["knn"]["boost"] == qb.knn_text_boost * 1.4
  231 + assert exact_clause["script_score"]["script"]["params"]["boost"] == qb.knn_text_boost * 1.4
  232 +
  233 +
  234 +def test_image_knn_plan_is_reused_for_ann_and_exact_rescore():
  235 + qb = _builder()
  236 +
  237 + ann_clause = qb.build_image_knn_clause(np.array([0.4, 0.5, 0.6]))
  238 + exact_clause = qb.build_exact_image_knn_rescore_clause(np.array([0.4, 0.5, 0.6]))
  239 +
  240 + assert ann_clause is not None
  241 + assert exact_clause is not None
  242 + assert ann_clause["nested"]["query"]["knn"]["boost"] == qb.knn_image_boost
  243 + assert exact_clause["nested"]["query"]["script_score"]["script"]["params"]["boost"] == qb.knn_image_boost
tests/test_llm_enrichment_batch_fill.py
@@ -10,8 +10,14 @@ from indexer.document_transformer import SPUDocumentTransformer @@ -10,8 +10,14 @@ from indexer.document_transformer import SPUDocumentTransformer
10 def test_fill_llm_attributes_batch_uses_product_enrich_helper(monkeypatch): 10 def test_fill_llm_attributes_batch_uses_product_enrich_helper(monkeypatch):
11 seen_calls: List[Dict[str, Any]] = [] 11 seen_calls: List[Dict[str, Any]] = []
12 12
13 - def _fake_build_index_content_fields(items, tenant_id=None):  
14 - seen_calls.append({"n": len(items), "tenant_id": tenant_id}) 13 + def _fake_build_index_content_fields(items, tenant_id=None, category_taxonomy_profile=None):
  14 + seen_calls.append(
  15 + {
  16 + "n": len(items),
  17 + "tenant_id": tenant_id,
  18 + "category_taxonomy_profile": category_taxonomy_profile,
  19 + }
  20 + )
15 return [ 21 return [
16 { 22 {
17 "id": item["id"], 23 "id": item["id"],
@@ -19,10 +25,13 @@ def test_fill_llm_attributes_batch_uses_product_enrich_helper(monkeypatch): @@ -19,10 +25,13 @@ def test_fill_llm_attributes_batch_uses_product_enrich_helper(monkeypatch):
19 "zh": [f"zh-anchor-{item['id']}"], 25 "zh": [f"zh-anchor-{item['id']}"],
20 "en": [f"en-anchor-{item['id']}"], 26 "en": [f"en-anchor-{item['id']}"],
21 }, 27 },
22 - "tags": {"zh": ["t1", "t2"], "en": ["t1", "t2"]}, 28 + "enriched_tags": {"zh": ["t1", "t2"], "en": ["t1", "t2"]},
23 "enriched_attributes": [ 29 "enriched_attributes": [
24 {"name": "tags", "value": {"zh": ["t1"], "en": ["t1"]}}, 30 {"name": "tags", "value": {"zh": ["t1"], "en": ["t1"]}},
25 ], 31 ],
  32 + "enriched_taxonomy_attributes": [
  33 + {"name": "Product Type", "value": {"zh": ["连衣裙"], "en": ["dress"]}},
  34 + ],
26 } 35 }
27 for item in items 36 for item in items
28 ] 37 ]
@@ -50,10 +59,14 @@ def test_fill_llm_attributes_batch_uses_product_enrich_helper(monkeypatch): @@ -50,10 +59,14 @@ def test_fill_llm_attributes_batch_uses_product_enrich_helper(monkeypatch):
50 59
51 transformer.fill_llm_attributes_batch(docs, rows) 60 transformer.fill_llm_attributes_batch(docs, rows)
52 61
53 - assert seen_calls == [{"n": 45, "tenant_id": "162"}] 62 + assert seen_calls == [{"n": 45, "tenant_id": "162", "category_taxonomy_profile": "apparel"}]
54 63
55 assert docs[0]["qanchors"]["zh"] == ["zh-anchor-0"] 64 assert docs[0]["qanchors"]["zh"] == ["zh-anchor-0"]
56 assert docs[0]["qanchors"]["en"] == ["en-anchor-0"] 65 assert docs[0]["qanchors"]["en"] == ["en-anchor-0"]
57 - assert docs[0]["tags"]["zh"] == ["t1", "t2"]  
58 - assert docs[0]["tags"]["en"] == ["t1", "t2"] 66 + assert docs[0]["enriched_tags"]["zh"] == ["t1", "t2"]
  67 + assert docs[0]["enriched_tags"]["en"] == ["t1", "t2"]
59 assert {"name": "tags", "value": {"zh": ["t1"], "en": ["t1"]}} in docs[0]["enriched_attributes"] 68 assert {"name": "tags", "value": {"zh": ["t1"], "en": ["t1"]}} in docs[0]["enriched_attributes"]
  69 + assert {
  70 + "name": "Product Type",
  71 + "value": {"zh": ["连衣裙"], "en": ["dress"]},
  72 + } in docs[0]["enriched_taxonomy_attributes"]
tests/test_process_products_batching.py
@@ -13,7 +13,15 @@ def test_analyze_products_caps_batch_size_to_20(monkeypatch): @@ -13,7 +13,15 @@ def test_analyze_products_caps_batch_size_to_20(monkeypatch):
13 monkeypatch.setattr(process_products, "API_KEY", "fake-key") 13 monkeypatch.setattr(process_products, "API_KEY", "fake-key")
14 seen_batch_sizes: List[int] = [] 14 seen_batch_sizes: List[int] = []
15 15
16 - def _fake_process_batch(batch_data: List[Dict[str, str]], batch_num: int, target_lang: str = "zh"): 16 + def _fake_process_batch(
  17 + batch_data: List[Dict[str, str]],
  18 + batch_num: int,
  19 + target_lang: str = "zh",
  20 + analysis_kind: str = "content",
  21 + category_taxonomy_profile=None,
  22 + ):
  23 + assert analysis_kind == "content"
  24 + assert category_taxonomy_profile is None
17 seen_batch_sizes.append(len(batch_data)) 25 seen_batch_sizes.append(len(batch_data))
18 return [ 26 return [
19 { 27 {
@@ -35,7 +43,7 @@ def test_analyze_products_caps_batch_size_to_20(monkeypatch): @@ -35,7 +43,7 @@ def test_analyze_products_caps_batch_size_to_20(monkeypatch):
35 ] 43 ]
36 44
37 monkeypatch.setattr(process_products, "process_batch", _fake_process_batch) 45 monkeypatch.setattr(process_products, "process_batch", _fake_process_batch)
38 - monkeypatch.setattr(process_products, "_set_cached_anchor_result", lambda *args, **kwargs: None) 46 + monkeypatch.setattr(process_products, "_set_cached_analysis_result", lambda *args, **kwargs: None)
39 47
40 out = process_products.analyze_products( 48 out = process_products.analyze_products(
41 products=_mk_products(45), 49 products=_mk_products(45),
@@ -53,7 +61,15 @@ def test_analyze_products_uses_min_batch_size_1(monkeypatch): @@ -53,7 +61,15 @@ def test_analyze_products_uses_min_batch_size_1(monkeypatch):
53 monkeypatch.setattr(process_products, "API_KEY", "fake-key") 61 monkeypatch.setattr(process_products, "API_KEY", "fake-key")
54 seen_batch_sizes: List[int] = [] 62 seen_batch_sizes: List[int] = []
55 63
56 - def _fake_process_batch(batch_data: List[Dict[str, str]], batch_num: int, target_lang: str = "zh"): 64 + def _fake_process_batch(
  65 + batch_data: List[Dict[str, str]],
  66 + batch_num: int,
  67 + target_lang: str = "zh",
  68 + analysis_kind: str = "content",
  69 + category_taxonomy_profile=None,
  70 + ):
  71 + assert analysis_kind == "content"
  72 + assert category_taxonomy_profile is None
57 seen_batch_sizes.append(len(batch_data)) 73 seen_batch_sizes.append(len(batch_data))
58 return [ 74 return [
59 { 75 {
@@ -75,7 +91,7 @@ def test_analyze_products_uses_min_batch_size_1(monkeypatch): @@ -75,7 +91,7 @@ def test_analyze_products_uses_min_batch_size_1(monkeypatch):
75 ] 91 ]
76 92
77 monkeypatch.setattr(process_products, "process_batch", _fake_process_batch) 93 monkeypatch.setattr(process_products, "process_batch", _fake_process_batch)
78 - monkeypatch.setattr(process_products, "_set_cached_anchor_result", lambda *args, **kwargs: None) 94 + monkeypatch.setattr(process_products, "_set_cached_analysis_result", lambda *args, **kwargs: None)
79 95
80 out = process_products.analyze_products( 96 out = process_products.analyze_products(
81 products=_mk_products(3), 97 products=_mk_products(3),
tests/test_product_enrich_partial_mode.py
@@ -74,6 +74,28 @@ def test_create_prompt_splits_shared_context_and_localized_tail(): @@ -74,6 +74,28 @@ def test_create_prompt_splits_shared_context_and_localized_tail():
74 assert prefix_en.startswith("| No. | Product title | Category path |") 74 assert prefix_en.startswith("| No. | Product title | Category path |")
75 75
76 76
  77 +def test_create_prompt_supports_taxonomy_analysis_kind():
  78 + products = [{"id": "1", "title": "linen dress"}]
  79 +
  80 + shared_zh, user_zh, prefix_zh = product_enrich.create_prompt(
  81 + products,
  82 + target_lang="zh",
  83 + analysis_kind="taxonomy",
  84 + )
  85 + shared_fr, user_fr, prefix_fr = product_enrich.create_prompt(
  86 + products,
  87 + target_lang="fr",
  88 + analysis_kind="taxonomy",
  89 + )
  90 +
  91 + assert "apparel attribute taxonomy" in shared_zh
  92 + assert "1. linen dress" in shared_zh
  93 + assert "Language: Chinese" in user_zh
  94 + assert "Language: French" in user_fr
  95 + assert prefix_zh.startswith("| 序号 | 品类 | 目标性别 |")
  96 + assert prefix_fr.startswith("| No. | Product Type | Target Gender |")
  97 +
  98 +
77 def test_call_llm_logs_shared_context_once_and_verbose_contains_full_requests(): 99 def test_call_llm_logs_shared_context_once_and_verbose_contains_full_requests():
78 payloads = [] 100 payloads = []
79 response_bodies = [ 101 response_bodies = [
@@ -228,6 +250,38 @@ def test_process_batch_reads_result_and_validates_expected_fields(): @@ -228,6 +250,38 @@ def test_process_batch_reads_result_and_validates_expected_fields():
228 assert row["anchor_text"] == "法式收腰连衣裙" 250 assert row["anchor_text"] == "法式收腰连衣裙"
229 251
230 252
  253 +def test_process_batch_reads_taxonomy_result_with_schema_specific_fields():
  254 + merged_markdown = """| 序号 | 品类 | 目标性别 | 年龄段 | 适用季节 | 版型 | 廓形 | 领型 | 袖长类型 | 袖型 | 肩带设计 | 腰型 | 裤型 | 裙型 | 长度类型 | 闭合方式 | 设计细节 | 面料 | 成分 | 面料特性 | 服装特征 | 功能 | 主颜色 | 色系 | 印花 / 图案 | 适用场景 | 风格 |
  255 +|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
  256 +| 1 | 连衣裙 | 女 | 成人 | 春季,夏季 | 修身 | A字 | V领 | 无袖 | | 细肩带 | 高腰 | | A字裙 | 中长款 | 拉链 | 褶皱 | 梭织 | 聚酯纤维,氨纶 | 轻薄,透气 | 有内衬 | 易打理 | 酒红色 | 红色 | 纯色 | 约会,度假 | 浪漫 |
  257 +"""
  258 +
  259 + with mock.patch.object(
  260 + product_enrich,
  261 + "call_llm",
  262 + return_value=(merged_markdown, json.dumps({"choices": [{"message": {"content": "stub"}}]})),
  263 + ):
  264 + results = product_enrich.process_batch(
  265 + [{"id": "sku-1", "title": "dress"}],
  266 + batch_num=1,
  267 + target_lang="zh",
  268 + analysis_kind="taxonomy",
  269 + )
  270 +
  271 + assert len(results) == 1
  272 + row = results[0]
  273 + assert row["id"] == "sku-1"
  274 + assert row["lang"] == "zh"
  275 + assert row["title_input"] == "dress"
  276 + assert row["product_type"] == "连衣裙"
  277 + assert row["target_gender"] == "女"
  278 + assert row["age_group"] == "成人"
  279 + assert row["sleeve_length_type"] == "无袖"
  280 + assert row["material_composition"] == "聚酯纤维,氨纶"
  281 + assert row["occasion_end_use"] == "约会,度假"
  282 + assert row["style_aesthetic"] == "浪漫"
  283 +
  284 +
231 def test_analyze_products_uses_product_level_cache_across_batch_requests(): 285 def test_analyze_products_uses_product_level_cache_across_batch_requests():
232 cache_store = {} 286 cache_store = {}
233 process_calls = [] 287 process_calls = []
@@ -241,13 +295,36 @@ def test_analyze_products_uses_product_level_cache_across_batch_requests(): @@ -241,13 +295,36 @@ def test_analyze_products_uses_product_level_cache_across_batch_requests():
241 product.get("image_url", ""), 295 product.get("image_url", ""),
242 ) 296 )
243 297
244 - def fake_get_cached_anchor_result(product, target_lang): 298 + def fake_get_cached_analysis_result(
  299 + product,
  300 + target_lang,
  301 + analysis_kind="content",
  302 + category_taxonomy_profile=None,
  303 + ):
  304 + assert analysis_kind == "content"
  305 + assert category_taxonomy_profile is None
245 return cache_store.get(_cache_key(product, target_lang)) 306 return cache_store.get(_cache_key(product, target_lang))
246 307
247 - def fake_set_cached_anchor_result(product, target_lang, result): 308 + def fake_set_cached_analysis_result(
  309 + product,
  310 + target_lang,
  311 + result,
  312 + analysis_kind="content",
  313 + category_taxonomy_profile=None,
  314 + ):
  315 + assert analysis_kind == "content"
  316 + assert category_taxonomy_profile is None
248 cache_store[_cache_key(product, target_lang)] = result 317 cache_store[_cache_key(product, target_lang)] = result
249 318
250 - def fake_process_batch(batch_data, batch_num, target_lang="zh"): 319 + def fake_process_batch(
  320 + batch_data,
  321 + batch_num,
  322 + target_lang="zh",
  323 + analysis_kind="content",
  324 + category_taxonomy_profile=None,
  325 + ):
  326 + assert analysis_kind == "content"
  327 + assert category_taxonomy_profile is None
251 process_calls.append( 328 process_calls.append(
252 { 329 {
253 "batch_num": batch_num, 330 "batch_num": batch_num,
@@ -281,12 +358,12 @@ def test_analyze_products_uses_product_level_cache_across_batch_requests(): @@ -281,12 +358,12 @@ def test_analyze_products_uses_product_level_cache_across_batch_requests():
281 358
282 with mock.patch.object(product_enrich, "API_KEY", "fake-key"), mock.patch.object( 359 with mock.patch.object(product_enrich, "API_KEY", "fake-key"), mock.patch.object(
283 product_enrich, 360 product_enrich,
284 - "_get_cached_anchor_result",  
285 - side_effect=fake_get_cached_anchor_result, 361 + "_get_cached_analysis_result",
  362 + side_effect=fake_get_cached_analysis_result,
286 ), mock.patch.object( 363 ), mock.patch.object(
287 product_enrich, 364 product_enrich,
288 - "_set_cached_anchor_result",  
289 - side_effect=fake_set_cached_anchor_result, 365 + "_set_cached_analysis_result",
  366 + side_effect=fake_set_cached_analysis_result,
290 ), mock.patch.object( 367 ), mock.patch.object(
291 product_enrich, 368 product_enrich,
292 "process_batch", 369 "process_batch",
@@ -342,11 +419,12 @@ def test_analyze_products_reuses_cached_content_with_current_product_identity(): @@ -342,11 +419,12 @@ def test_analyze_products_reuses_cached_content_with_current_product_identity():
342 419
343 with mock.patch.object(product_enrich, "API_KEY", "fake-key"), mock.patch.object( 420 with mock.patch.object(product_enrich, "API_KEY", "fake-key"), mock.patch.object(
344 product_enrich, 421 product_enrich,
345 - "_get_cached_anchor_result",  
346 - wraps=lambda product, target_lang: product_enrich._normalize_analysis_result( 422 + "_get_cached_analysis_result",
  423 + wraps=lambda product, target_lang, analysis_kind="content", category_taxonomy_profile=None: product_enrich._normalize_analysis_result(
347 cached_result, 424 cached_result,
348 product=product, 425 product=product,
349 target_lang=target_lang, 426 target_lang=target_lang,
  427 + schema=product_enrich._get_analysis_schema("content"),
350 ), 428 ),
351 ), mock.patch.object( 429 ), mock.patch.object(
352 product_enrich, 430 product_enrich,
@@ -379,7 +457,49 @@ def test_analyze_products_reuses_cached_content_with_current_product_identity(): @@ -379,7 +457,49 @@ def test_analyze_products_reuses_cached_content_with_current_product_identity():
379 457
380 458
381 def test_build_index_content_fields_maps_internal_tags_to_enriched_tags_output(): 459 def test_build_index_content_fields_maps_internal_tags_to_enriched_tags_output():
382 - def fake_analyze_products(products, target_lang="zh", batch_size=None, tenant_id=None): 460 + def fake_analyze_products(
  461 + products,
  462 + target_lang="zh",
  463 + batch_size=None,
  464 + tenant_id=None,
  465 + analysis_kind="content",
  466 + category_taxonomy_profile=None,
  467 + ):
  468 + if analysis_kind == "taxonomy":
  469 + assert category_taxonomy_profile == "apparel"
  470 + return [
  471 + {
  472 + "id": products[0]["id"],
  473 + "lang": target_lang,
  474 + "title_input": products[0]["title"],
  475 + "product_type": f"{target_lang}-dress",
  476 + "target_gender": f"{target_lang}-women",
  477 + "age_group": "",
  478 + "season": f"{target_lang}-summer",
  479 + "fit": "",
  480 + "silhouette": "",
  481 + "neckline": "",
  482 + "sleeve_length_type": "",
  483 + "sleeve_style": "",
  484 + "strap_type": "",
  485 + "rise_waistline": "",
  486 + "leg_shape": "",
  487 + "skirt_shape": "",
  488 + "length_type": "",
  489 + "closure_type": "",
  490 + "design_details": "",
  491 + "fabric": "",
  492 + "material_composition": "",
  493 + "fabric_properties": "",
  494 + "clothing_features": "",
  495 + "functional_benefits": "",
  496 + "color": "",
  497 + "color_family": "",
  498 + "print_pattern": "",
  499 + "occasion_end_use": "",
  500 + "style_aesthetic": "",
  501 + }
  502 + ]
383 return [ 503 return [
384 { 504 {
385 "id": products[0]["id"], 505 "id": products[0]["id"],
@@ -423,8 +543,103 @@ def test_build_index_content_fields_maps_internal_tags_to_enriched_tags_output() @@ -423,8 +543,103 @@ def test_build_index_content_fields_maps_internal_tags_to_enriched_tags_output()
423 }, 543 },
424 {"name": "target_audience", "value": {"zh": ["zh-audience"], "en": ["en-audience"]}}, 544 {"name": "target_audience", "value": {"zh": ["zh-audience"], "en": ["en-audience"]}},
425 ], 545 ],
  546 + "enriched_taxonomy_attributes": [
  547 + {
  548 + "name": "Product Type",
  549 + "value": {"zh": ["zh-dress"], "en": ["en-dress"]},
  550 + },
  551 + {
  552 + "name": "Target Gender",
  553 + "value": {"zh": ["zh-women"], "en": ["en-women"]},
  554 + },
  555 + {
  556 + "name": "Season",
  557 + "value": {"zh": ["zh-summer"], "en": ["en-summer"]},
  558 + },
  559 + ],
426 } 560 }
427 ] 561 ]
  562 +def test_build_index_content_fields_non_apparel_taxonomy_returns_en_only():
  563 + seen_calls = []
  564 +
  565 + def fake_analyze_products(
  566 + products,
  567 + target_lang="zh",
  568 + batch_size=None,
  569 + tenant_id=None,
  570 + analysis_kind="content",
  571 + category_taxonomy_profile=None,
  572 + ):
  573 + seen_calls.append((analysis_kind, target_lang, category_taxonomy_profile, tuple(p["id"] for p in products)))
  574 + if analysis_kind == "taxonomy":
  575 + assert category_taxonomy_profile == "toys"
  576 + assert target_lang == "en"
  577 + return [
  578 + {
  579 + "id": products[0]["id"],
  580 + "lang": "en",
  581 + "title_input": products[0]["title"],
  582 + "product_type": "doll set",
  583 + "age_group": "kids",
  584 + "character_theme": "",
  585 + "material": "",
  586 + "power_source": "",
  587 + "interactive_features": "",
  588 + "educational_play_value": "",
  589 + "piece_count_size": "",
  590 + "color": "",
  591 + "use_scenario": "",
  592 + }
  593 + ]
  594 +
  595 + return [
  596 + {
  597 + "id": product["id"],
  598 + "lang": target_lang,
  599 + "title_input": product["title"],
  600 + "title": product["title"],
  601 + "category_path": "",
  602 + "tags": f"{target_lang}-tag",
  603 + "target_audience": "",
  604 + "usage_scene": "",
  605 + "season": "",
  606 + "key_attributes": "",
  607 + "material": "",
  608 + "features": "",
  609 + "anchor_text": f"{target_lang}-anchor",
  610 + }
  611 + for product in products
  612 + ]
  613 +
  614 + with mock.patch.object(product_enrich, "analyze_products", side_effect=fake_analyze_products):
  615 + result = product_enrich.build_index_content_fields(
  616 + items=[{"spu_id": "2", "title": "toy"}],
  617 + tenant_id="170",
  618 + category_taxonomy_profile="toys",
  619 + )
  620 +
  621 + assert result == [
  622 + {
  623 + "id": "2",
  624 + "qanchors": {"zh": ["zh-anchor"], "en": ["en-anchor"]},
  625 + "enriched_tags": {"zh": ["zh-tag"], "en": ["en-tag"]},
  626 + "enriched_attributes": [
  627 + {
  628 + "name": "enriched_tags",
  629 + "value": {
  630 + "zh": ["zh-tag"],
  631 + "en": ["en-tag"],
  632 + },
  633 + }
  634 + ],
  635 + "enriched_taxonomy_attributes": [
  636 + {"name": "Product Type", "value": {"en": ["doll set"]}},
  637 + {"name": "Age Group", "value": {"en": ["kids"]}},
  638 + ],
  639 + }
  640 + ]
  641 + assert ("taxonomy", "zh", "toys", ("2",)) not in seen_calls
  642 + assert ("taxonomy", "en", "toys", ("2",)) in seen_calls
428 643
429 644
430 def test_anchor_cache_key_depends_on_product_input_not_identifiers(): 645 def test_anchor_cache_key_depends_on_product_input_not_identifiers():
@@ -461,6 +676,40 @@ def test_anchor_cache_key_depends_on_product_input_not_identifiers(): @@ -461,6 +676,40 @@ def test_anchor_cache_key_depends_on_product_input_not_identifiers():
461 assert key_a != key_c 676 assert key_a != key_c
462 677
463 678
  679 +def test_analysis_cache_key_isolated_by_analysis_kind():
  680 + product = {
  681 + "id": "1",
  682 + "title": "dress",
  683 + "brief": "soft cotton",
  684 + "description": "summer dress",
  685 + }
  686 +
  687 + content_key = product_enrich._make_analysis_cache_key(product, "zh", "content")
  688 + taxonomy_key = product_enrich._make_analysis_cache_key(product, "zh", "taxonomy")
  689 +
  690 + assert content_key != taxonomy_key
  691 +
  692 +
  693 +def test_analysis_cache_key_changes_when_prompt_contract_changes():
  694 + product = {
  695 + "id": "1",
  696 + "title": "dress",
  697 + "brief": "soft cotton",
  698 + "description": "summer dress",
  699 + }
  700 +
  701 + original_key = product_enrich._make_analysis_cache_key(product, "zh", "taxonomy")
  702 +
  703 + with mock.patch.object(
  704 + product_enrich,
  705 + "USER_INSTRUCTION_TEMPLATE",
  706 + "Please return JSON only. Language: {language}",
  707 + ):
  708 + changed_key = product_enrich._make_analysis_cache_key(product, "zh", "taxonomy")
  709 +
  710 + assert original_key != changed_key
  711 +
  712 +
464 def test_build_prompt_input_text_appends_brief_and_description_for_short_title(): 713 def test_build_prompt_input_text_appends_brief_and_description_for_short_title():
465 product = { 714 product = {
466 "title": "T恤", 715 "title": "T恤",
tests/test_rerank_client.py
1 from math import isclose 1 from math import isclose
2 2
3 -from config.schema import RerankFusionConfig  
4 -from search.rerank_client import fuse_scores_and_resort, run_lightweight_rerank 3 +from config.schema import CoarseRankFusionConfig, RerankFusionConfig
  4 +from search.rerank_client import coarse_resort_hits, fuse_scores_and_resort, run_lightweight_rerank
5 5
6 6
7 def test_fuse_scores_and_resort_aggregates_text_components_and_keeps_rerank_primary(): 7 def test_fuse_scores_and_resort_aggregates_text_components_and_keeps_rerank_primary():
@@ -172,6 +172,57 @@ def test_fuse_scores_and_resort_uses_max_of_text_and_image_knn_scores(): @@ -172,6 +172,57 @@ def test_fuse_scores_and_resort_uses_max_of_text_and_image_knn_scores():
172 assert isclose(debug[0]["image_knn_score"], 0.7, rel_tol=1e-9) 172 assert isclose(debug[0]["image_knn_score"], 0.7, rel_tol=1e-9)
173 173
174 174
  175 +def test_fuse_scores_and_resort_prefers_exact_knn_scores_over_ann_scores():
  176 + hits = [
  177 + {
  178 + "_id": "exact-mm-hit",
  179 + "_score": 1.0,
  180 + "matched_queries": {
  181 + "base_query": 1.5,
  182 + "knn_query": 0.2,
  183 + "image_knn_query": 0.7,
  184 + "exact_text_knn_query": 0.9,
  185 + "exact_image_knn_query": 0.1,
  186 + },
  187 + }
  188 + ]
  189 +
  190 + debug = fuse_scores_and_resort(hits, [0.8], debug=True)
  191 +
  192 + assert isclose(hits[0]["_knn_score"], 0.9, rel_tol=1e-9)
  193 + assert isclose(debug[0]["text_knn_score"], 0.9, rel_tol=1e-9)
  194 + assert isclose(debug[0]["image_knn_score"], 0.1, rel_tol=1e-9)
  195 + assert isclose(debug[0]["exact_text_knn_score"], 0.9, rel_tol=1e-9)
  196 + assert isclose(debug[0]["exact_image_knn_score"], 0.1, rel_tol=1e-9)
  197 + assert isclose(debug[0]["approx_text_knn_score"], 0.2, rel_tol=1e-9)
  198 + assert isclose(debug[0]["approx_image_knn_score"], 0.7, rel_tol=1e-9)
  199 + assert debug[0]["text_knn_source"] == "exact_text_knn_query"
  200 + assert debug[0]["image_knn_source"] == "exact_image_knn_query"
  201 +
  202 +
  203 +def test_fuse_scores_and_resort_falls_back_to_ann_when_exact_knn_missing():
  204 + hits = [
  205 + {
  206 + "_id": "ann-only-hit",
  207 + "_score": 1.0,
  208 + "matched_queries": {
  209 + "base_query": 1.5,
  210 + "knn_query": 0.4,
  211 + "image_knn_query": 0.5,
  212 + },
  213 + }
  214 + ]
  215 +
  216 + debug = fuse_scores_and_resort(hits, [0.8], debug=True)
  217 +
  218 + assert isclose(debug[0]["text_knn_score"], 0.4, rel_tol=1e-9)
  219 + assert isclose(debug[0]["image_knn_score"], 0.5, rel_tol=1e-9)
  220 + assert isclose(debug[0]["approx_text_knn_score"], 0.4, rel_tol=1e-9)
  221 + assert isclose(debug[0]["approx_image_knn_score"], 0.5, rel_tol=1e-9)
  222 + assert debug[0]["text_knn_source"] == "knn_query"
  223 + assert debug[0]["image_knn_source"] == "image_knn_query"
  224 +
  225 +
175 def test_fuse_scores_and_resort_applies_knn_dismax_weights_and_tie_breaker(): 226 def test_fuse_scores_and_resort_applies_knn_dismax_weights_and_tie_breaker():
176 hits = [ 227 hits = [
177 { 228 {
@@ -206,6 +257,96 @@ def test_fuse_scores_and_resort_applies_knn_dismax_weights_and_tie_breaker(): @@ -206,6 +257,96 @@ def test_fuse_scores_and_resort_applies_knn_dismax_weights_and_tie_breaker():
206 assert isclose(debug[0]["knn_support_score"], 0.5, rel_tol=1e-9) 257 assert isclose(debug[0]["knn_support_score"], 0.5, rel_tol=1e-9)
207 258
208 259
  260 +def test_fuse_scores_and_resort_can_add_weighted_text_and_image_knn_factors():
  261 + hits = [
  262 + {
  263 + "_id": "a",
  264 + "_score": 1.0,
  265 + "matched_queries": {
  266 + "base_query": 2.0,
  267 + "knn_query": 0.4,
  268 + "image_knn_query": 0.5,
  269 + },
  270 + }
  271 + ]
  272 + fusion = RerankFusionConfig(
  273 + rerank_bias=0.0,
  274 + rerank_exponent=1.0,
  275 + text_bias=0.0,
  276 + text_exponent=1.0,
  277 + knn_text_weight=2.0,
  278 + knn_image_weight=1.0,
  279 + knn_tie_breaker=0.25,
  280 + knn_bias=0.1,
  281 + knn_exponent=1.0,
  282 + knn_text_exponent=2.0,
  283 + knn_image_exponent=3.0,
  284 + )
  285 +
  286 + debug = fuse_scores_and_resort(hits, [0.8], fusion=fusion, debug=True)
  287 +
  288 + weighted_text_knn = 0.8
  289 + weighted_image_knn = 0.5
  290 + expected_knn = weighted_text_knn + 0.25 * weighted_image_knn
  291 + expected_fused = (
  292 + 0.8
  293 + * 2.0
  294 + * (expected_knn + 0.1)
  295 + * ((weighted_text_knn + 0.1) ** 2.0)
  296 + * ((weighted_image_knn + 0.1) ** 3.0)
  297 + )
  298 +
  299 + assert isclose(hits[0]["_fused_score"], expected_fused, rel_tol=1e-9)
  300 + assert isclose(debug[0]["text_knn_factor"], (weighted_text_knn + 0.1) ** 2.0, rel_tol=1e-9)
  301 + assert isclose(debug[0]["image_knn_factor"], (weighted_image_knn + 0.1) ** 3.0, rel_tol=1e-9)
  302 + assert "weighted_text_knn_score=" in debug[0]["fusion_summary"]
  303 + assert "weighted_image_knn_score=" in debug[0]["fusion_summary"]
  304 +
  305 +
  306 +def test_coarse_resort_hits_can_add_weighted_text_and_image_knn_factors():
  307 + hits = [
  308 + {
  309 + "_id": "coarse-a",
  310 + "_score": 1.0,
  311 + "matched_queries": {
  312 + "base_query": 2.0,
  313 + "knn_query": 0.4,
  314 + "image_knn_query": 0.5,
  315 + },
  316 + }
  317 + ]
  318 + fusion = CoarseRankFusionConfig(
  319 + es_bias=0.0,
  320 + es_exponent=1.0,
  321 + text_bias=0.0,
  322 + text_exponent=1.0,
  323 + knn_text_weight=2.0,
  324 + knn_image_weight=1.0,
  325 + knn_tie_breaker=0.25,
  326 + knn_bias=0.1,
  327 + knn_exponent=1.0,
  328 + knn_text_exponent=2.0,
  329 + knn_image_exponent=3.0,
  330 + )
  331 +
  332 + debug = coarse_resort_hits(hits, fusion=fusion, debug=True)
  333 +
  334 + weighted_text_knn = 0.8
  335 + weighted_image_knn = 0.5
  336 + expected_knn = weighted_text_knn + 0.25 * weighted_image_knn
  337 + expected_coarse = (
  338 + 1.0
  339 + * 2.0
  340 + * (expected_knn + 0.1)
  341 + * ((weighted_text_knn + 0.1) ** 2.0)
  342 + * ((weighted_image_knn + 0.1) ** 3.0)
  343 + )
  344 +
  345 + assert isclose(hits[0]["_coarse_score"], expected_coarse, rel_tol=1e-9)
  346 + assert isclose(debug[0]["coarse_text_knn_factor"], (weighted_text_knn + 0.1) ** 2.0, rel_tol=1e-9)
  347 + assert isclose(debug[0]["coarse_image_knn_factor"], (weighted_image_knn + 0.1) ** 3.0, rel_tol=1e-9)
  348 +
  349 +
209 def test_run_lightweight_rerank_sorts_by_fused_stage_score(monkeypatch): 350 def test_run_lightweight_rerank_sorts_by_fused_stage_score(monkeypatch):
210 hits = [ 351 hits = [
211 { 352 {
tests/test_search_rerank_window.py
1 from __future__ import annotations 1 from __future__ import annotations
2 2
3 -from dataclasses import dataclass 3 +from dataclasses import dataclass, field
4 from pathlib import Path 4 from pathlib import Path
5 from types import SimpleNamespace 5 from types import SimpleNamespace
6 from typing import Any, Dict, List 6 from typing import Any, Dict, List
@@ -30,7 +30,10 @@ class _FakeParsedQuery: @@ -30,7 +30,10 @@ class _FakeParsedQuery:
30 rewritten_query: str 30 rewritten_query: str
31 detected_language: str = "en" 31 detected_language: str = "en"
32 translations: Dict[str, str] = None 32 translations: Dict[str, str] = None
  33 + keywords_queries: Dict[str, str] = field(default_factory=dict)
33 query_vector: Any = None 34 query_vector: Any = None
  35 + image_query_vector: Any = None
  36 + query_tokens: List[str] = field(default_factory=list)
34 style_intent_profile: Any = None 37 style_intent_profile: Any = None
35 38
36 def text_for_rerank(self) -> str: 39 def text_for_rerank(self) -> str:
@@ -89,6 +92,15 @@ class _FakeQueryParser: @@ -89,6 +92,15 @@ class _FakeQueryParser:
89 92
90 93
91 class _FakeQueryBuilder: 94 class _FakeQueryBuilder:
  95 + knn_text_k = 120
  96 + knn_text_k_long = 160
  97 + knn_text_num_candidates = 400
  98 + knn_text_num_candidates_long = 500
  99 + knn_text_boost = 20.0
  100 + knn_image_k = 120
  101 + knn_image_num_candidates = 400
  102 + knn_image_boost = 20.0
  103 +
92 def build_query(self, **kwargs): 104 def build_query(self, **kwargs):
93 return { 105 return {
94 "query": {"match_all": {}}, 106 "query": {"match_all": {}},
@@ -185,13 +197,24 @@ class _FakeESClient: @@ -185,13 +197,24 @@ class _FakeESClient:
185 } 197 }
186 198
187 199
188 -def _build_search_config(*, rerank_enabled: bool = True, rerank_window: int = 384): 200 +def _build_search_config(
  201 + *,
  202 + rerank_enabled: bool = True,
  203 + rerank_window: int = 384,
  204 + exact_knn_rescore_enabled: bool = False,
  205 + exact_knn_rescore_window: int = 0,
  206 +):
189 return SearchConfig( 207 return SearchConfig(
190 field_boosts={"title.en": 3.0}, 208 field_boosts={"title.en": 3.0},
191 indexes=[IndexConfig(name="default", label="default", fields=["title.en"])], 209 indexes=[IndexConfig(name="default", label="default", fields=["title.en"])],
192 query_config=QueryConfig(enable_text_embedding=False, enable_query_rewrite=False), 210 query_config=QueryConfig(enable_text_embedding=False, enable_query_rewrite=False),
193 function_score=FunctionScoreConfig(), 211 function_score=FunctionScoreConfig(),
194 - rerank=RerankConfig(enabled=rerank_enabled, rerank_window=rerank_window), 212 + rerank=RerankConfig(
  213 + enabled=rerank_enabled,
  214 + rerank_window=rerank_window,
  215 + exact_knn_rescore_enabled=exact_knn_rescore_enabled,
  216 + exact_knn_rescore_window=exact_knn_rescore_window,
  217 + ),
195 spu_config=SPUConfig(enabled=False), 218 spu_config=SPUConfig(enabled=False),
196 es_index_name="test_products", 219 es_index_name="test_products",
197 es_settings={}, 220 es_settings={},
@@ -289,7 +312,11 @@ def test_config_loader_rerank_enabled_defaults_true(tmp_path: Path): @@ -289,7 +312,11 @@ def test_config_loader_rerank_enabled_defaults_true(tmp_path: Path):
289 }, 312 },
290 "spu_config": {"enabled": False}, 313 "spu_config": {"enabled": False},
291 "function_score": {"score_mode": "sum", "boost_mode": "multiply", "functions": []}, 314 "function_score": {"score_mode": "sum", "boost_mode": "multiply", "functions": []},
292 - "rerank": {"rerank_window": 384}, 315 + "rerank": {
  316 + "rerank_window": 384,
  317 + "exact_knn_rescore_enabled": True,
  318 + "exact_knn_rescore_window": 160,
  319 + },
293 } 320 }
294 config_path = tmp_path / "config.yaml" 321 config_path = tmp_path / "config.yaml"
295 config_path.write_text(yaml.safe_dump(config_data), encoding="utf-8") 322 config_path.write_text(yaml.safe_dump(config_data), encoding="utf-8")
@@ -298,6 +325,8 @@ def test_config_loader_rerank_enabled_defaults_true(tmp_path: Path): @@ -298,6 +325,8 @@ def test_config_loader_rerank_enabled_defaults_true(tmp_path: Path):
298 loaded = loader.load_config(validate=False) 325 loaded = loader.load_config(validate=False)
299 326
300 assert loaded.rerank.enabled is True 327 assert loaded.rerank.enabled is True
  328 + assert loaded.rerank.exact_knn_rescore_enabled is True
  329 + assert loaded.rerank.exact_knn_rescore_window == 160
301 330
302 331
303 def test_config_loader_parses_named_rerank_instances(tmp_path: Path): 332 def test_config_loader_parses_named_rerank_instances(tmp_path: Path):
@@ -583,7 +612,7 @@ def test_searcher_rerank_prefetch_source_includes_sku_fields_when_style_intent_a @@ -583,7 +612,7 @@ def test_searcher_rerank_prefetch_source_includes_sku_fields_when_style_intent_a
583 } 612 }
584 613
585 614
586 -def test_searcher_skips_rerank_when_request_explicitly_false(monkeypatch): 615 +def test_searcher_keeps_previous_stage_order_when_request_explicitly_disables_rerank(monkeypatch):
587 es_client = _FakeESClient() 616 es_client = _FakeESClient()
588 searcher = _build_searcher(_build_search_config(rerank_enabled=True), es_client) 617 searcher = _build_searcher(_build_search_config(rerank_enabled=True), es_client)
589 context = create_request_context(reqid="t2", uid="u2") 618 context = create_request_context(reqid="t2", uid="u2")
@@ -593,28 +622,95 @@ def test_searcher_skips_rerank_when_request_explicitly_false(monkeypatch): @@ -593,28 +622,95 @@ def test_searcher_skips_rerank_when_request_explicitly_false(monkeypatch):
593 lambda: SimpleNamespace(get_tenant_config=lambda tenant_id: {"index_languages": ["en"]}), 622 lambda: SimpleNamespace(get_tenant_config=lambda tenant_id: {"index_languages": ["en"]}),
594 ) 623 )
595 624
596 - called: Dict[str, int] = {"count": 0} 625 + called: Dict[str, int] = {"count": 0, "fine": 0}
  626 +
  627 + def _fake_run_lightweight_rerank(**kwargs):
  628 + called["fine"] += 1
  629 + hits = kwargs["es_hits"]
  630 + for idx, hit in enumerate(hits):
  631 + hit["_fine_score"] = float(idx + 1)
  632 + hits.reverse()
  633 + return [hit["_fine_score"] for hit in hits], {"stage": "fine"}, []
597 634
598 def _fake_run_rerank(**kwargs): 635 def _fake_run_rerank(**kwargs):
599 called["count"] += 1 636 called["count"] += 1
600 return kwargs["es_response"], None, [] 637 return kwargs["es_response"], None, []
601 638
  639 + monkeypatch.setattr("search.rerank_client.run_lightweight_rerank", _fake_run_lightweight_rerank)
602 monkeypatch.setattr("search.rerank_client.run_rerank", _fake_run_rerank) 640 monkeypatch.setattr("search.rerank_client.run_rerank", _fake_run_rerank)
603 641
604 - searcher.search( 642 + result = searcher.search(
605 query="toy", 643 query="toy",
606 tenant_id="162", 644 tenant_id="162",
607 from_=20, 645 from_=20,
608 size=10, 646 size=10,
609 context=context, 647 context=context,
610 enable_rerank=False, 648 enable_rerank=False,
  649 + debug=True,
611 ) 650 )
612 651
613 assert called["count"] == 0 652 assert called["count"] == 0
614 - assert es_client.calls[0]["from_"] == 20  
615 - assert es_client.calls[0]["size"] == 10  
616 - assert es_client.calls[0]["include_named_queries_score"] is False  
617 - assert len(es_client.calls) == 1 653 + assert called["fine"] == 1
  654 + assert es_client.calls[0]["from_"] == 0
  655 + assert es_client.calls[0]["size"] == searcher.config.coarse_rank.input_window
  656 + assert es_client.calls[0]["include_named_queries_score"] is True
  657 + assert len(es_client.calls) == 3
  658 + assert es_client.calls[2]["body"]["query"]["ids"]["values"] == [str(i) for i in range(363, 353, -1)]
  659 + assert len(result.results) == 10
  660 + assert [item.spu_id for item in result.results[:3]] == ["363", "362", "361"]
  661 + assert result.debug_info["rerank"]["enabled"] is False
  662 + assert result.debug_info["rerank"]["applied"] is False
  663 + assert result.debug_info["rerank"]["skipped_reason"] == "disabled"
  664 + assert result.debug_info["per_result"][0]["ranking_funnel"]["rerank"]["rank"] == 21
  665 +
  666 +
  667 +def test_searcher_keeps_previous_stage_order_when_config_disables_rerank(monkeypatch):
  668 + es_client = _FakeESClient()
  669 + searcher = _build_searcher(_build_search_config(rerank_enabled=False), es_client)
  670 + context = create_request_context(reqid="t2b", uid="u2b")
  671 +
  672 + monkeypatch.setattr(
  673 + "search.searcher.get_tenant_config_loader",
  674 + lambda: SimpleNamespace(get_tenant_config=lambda tenant_id: {"index_languages": ["en"]}),
  675 + )
  676 +
  677 + called: Dict[str, int] = {"count": 0, "fine": 0}
  678 +
  679 + def _fake_run_lightweight_rerank(**kwargs):
  680 + called["fine"] += 1
  681 + hits = kwargs["es_hits"]
  682 + hits.reverse()
  683 + for idx, hit in enumerate(hits):
  684 + hit["_fine_score"] = float(len(hits) - idx)
  685 + return [hit["_fine_score"] for hit in hits], {"stage": "fine"}, []
  686 +
  687 + def _fake_run_rerank(**kwargs):
  688 + called["count"] += 1
  689 + return kwargs["es_response"], None, []
  690 +
  691 + monkeypatch.setattr("search.rerank_client.run_lightweight_rerank", _fake_run_lightweight_rerank)
  692 + monkeypatch.setattr("search.rerank_client.run_rerank", _fake_run_rerank)
  693 +
  694 + result = searcher.search(
  695 + query="toy",
  696 + tenant_id="162",
  697 + from_=0,
  698 + size=5,
  699 + context=context,
  700 + enable_rerank=None,
  701 + debug=True,
  702 + )
  703 +
  704 + assert called["count"] == 0
  705 + assert called["fine"] == 1
  706 + assert es_client.calls[0]["from_"] == 0
  707 + assert es_client.calls[0]["size"] == searcher.config.coarse_rank.input_window
  708 + assert es_client.calls[0]["include_named_queries_score"] is True
  709 + assert len(result.results) == 5
  710 + assert [item.spu_id for item in result.results] == ["383", "382", "381", "380", "379"]
  711 + assert result.debug_info["rerank"]["enabled"] is False
  712 + assert result.debug_info["rerank"]["applied"] is False
  713 + assert result.debug_info["rerank"]["skipped_reason"] == "disabled"
618 714
619 715
620 def test_searcher_skips_rerank_when_page_exceeds_window(monkeypatch): 716 def test_searcher_skips_rerank_when_page_exceeds_window(monkeypatch):
@@ -919,7 +1015,8 @@ def test_searcher_promotes_sku_by_embedding_when_query_has_no_direct_option_matc @@ -919,7 +1015,8 @@ def test_searcher_promotes_sku_by_embedding_when_query_has_no_direct_option_matc
919 1015
920 def test_searcher_debug_info_uses_initial_es_max_score_for_normalization(monkeypatch): 1016 def test_searcher_debug_info_uses_initial_es_max_score_for_normalization(monkeypatch):
921 es_client = _FakeESClient(total_hits=3) 1017 es_client = _FakeESClient(total_hits=3)
922 - searcher = _build_searcher(_build_search_config(rerank_enabled=False), es_client) 1018 + cfg = _build_search_config(rerank_enabled=False)
  1019 + searcher = _build_searcher(cfg, es_client)
923 context = create_request_context(reqid="dbg", uid="u-dbg") 1020 context = create_request_context(reqid="dbg", uid="u-dbg")
924 1021
925 monkeypatch.setattr( 1022 monkeypatch.setattr(
@@ -939,7 +1036,8 @@ def test_searcher_debug_info_uses_initial_es_max_score_for_normalization(monkeyp @@ -939,7 +1036,8 @@ def test_searcher_debug_info_uses_initial_es_max_score_for_normalization(monkeyp
939 1036
940 assert result.debug_info["query_analysis"]["index_languages"] == ["en", "zh"] 1037 assert result.debug_info["query_analysis"]["index_languages"] == ["en", "zh"]
941 assert result.debug_info["query_analysis"]["query_tokens"] == [] 1038 assert result.debug_info["query_analysis"]["query_tokens"] == []
942 - assert result.debug_info["es_query_context"]["es_fetch_size"] == 2 1039 + expected_es_fetch = max(cfg.rerank.rerank_window, cfg.coarse_rank.input_window)
  1040 + assert result.debug_info["es_query_context"]["es_fetch_size"] == expected_es_fetch
943 assert result.debug_info["es_response"]["es_score_normalization_factor"] == 3.0 1041 assert result.debug_info["es_response"]["es_score_normalization_factor"] == 3.0
944 assert result.debug_info["per_result"][0]["initial_rank"] == 1 1042 assert result.debug_info["per_result"][0]["initial_rank"] == 1
945 assert result.debug_info["per_result"][0]["final_rank"] == 1 1043 assert result.debug_info["per_result"][0]["final_rank"] == 1
@@ -947,6 +1045,166 @@ def test_searcher_debug_info_uses_initial_es_max_score_for_normalization(monkeyp @@ -947,6 +1045,166 @@ def test_searcher_debug_info_uses_initial_es_max_score_for_normalization(monkeyp
947 assert result.debug_info["per_result"][1]["es_score_normalized"] == 2.0 / 3.0 1045 assert result.debug_info["per_result"][1]["es_score_normalized"] == 2.0 / 3.0
948 1046
949 1047
  1048 +def test_searcher_attaches_exact_knn_rescore_for_rank_window(monkeypatch):
  1049 + class _VectorQueryParser:
  1050 + def parse(self, query: str, tenant_id: str, generate_vector: bool, context: Any, target_languages: Any = None):
  1051 + return _FakeParsedQuery(
  1052 + original_query=query,
  1053 + query_normalized=query,
  1054 + rewritten_query=query,
  1055 + translations={},
  1056 + query_vector=np.array([0.1, 0.2, 0.3], dtype=np.float32),
  1057 + image_query_vector=np.array([0.4, 0.5, 0.6], dtype=np.float32),
  1058 + query_tokens=["dress", "formal", "spring", "summer", "floral"],
  1059 + )
  1060 +
  1061 + es_client = _FakeESClient(total_hits=5)
  1062 + base = _build_search_config(
  1063 + rerank_enabled=True,
  1064 + rerank_window=5,
  1065 + exact_knn_rescore_enabled=True,
  1066 + exact_knn_rescore_window=3,
  1067 + )
  1068 + config = SearchConfig(
  1069 + field_boosts=base.field_boosts,
  1070 + indexes=base.indexes,
  1071 + query_config=QueryConfig(
  1072 + enable_text_embedding=True,
  1073 + enable_query_rewrite=False,
  1074 + text_embedding_field="title_embedding",
  1075 + image_embedding_field="image_embedding.vector",
  1076 + ),
  1077 + function_score=base.function_score,
  1078 + coarse_rank=base.coarse_rank,
  1079 + fine_rank=FineRankConfig(enabled=False, input_window=5, output_window=5),
  1080 + rerank=base.rerank,
  1081 + spu_config=base.spu_config,
  1082 + es_index_name=base.es_index_name,
  1083 + es_settings=base.es_settings,
  1084 + )
  1085 + searcher = Searcher(
  1086 + es_client=es_client,
  1087 + config=config,
  1088 + query_parser=_VectorQueryParser(),
  1089 + image_encoder=SimpleNamespace(),
  1090 + )
  1091 + context = create_request_context(reqid="exact-rescore", uid="u-exact")
  1092 +
  1093 + monkeypatch.setattr(
  1094 + "search.searcher.get_tenant_config_loader",
  1095 + lambda: SimpleNamespace(get_tenant_config=lambda tenant_id: {"index_languages": ["en"]}),
  1096 + )
  1097 +
  1098 + searcher.search(
  1099 + query="dress",
  1100 + tenant_id="162",
  1101 + from_=0,
  1102 + size=2,
  1103 + context=context,
  1104 + enable_rerank=False,
  1105 + debug=True,
  1106 + )
  1107 +
  1108 + body = es_client.calls[0]["body"]
  1109 + assert body["rescore"]["window_size"] == 3
  1110 + assert body["rescore"]["query"]["score_mode"] == "total"
  1111 + assert body["rescore"]["query"]["rescore_query_weight"] == 0.0
  1112 + should = body["rescore"]["query"]["rescore_query"]["bool"]["should"]
  1113 + names = []
  1114 + for clause in should:
  1115 + if "script_score" in clause:
  1116 + names.append(clause["script_score"]["_name"])
  1117 + elif "nested" in clause:
  1118 + names.append(clause["nested"]["_name"])
  1119 + assert names == ["exact_text_knn_query", "exact_image_knn_query"]
  1120 + recall_query = body["query"]
  1121 + if "bool" in recall_query and recall_query["bool"].get("must"):
  1122 + recall_query = recall_query["bool"]["must"][0]
  1123 + if "function_score" in recall_query:
  1124 + recall_query = recall_query["function_score"]["query"]
  1125 + recall_should = recall_query["bool"]["should"]
  1126 + text_knn_clause = next(
  1127 + clause["knn"]
  1128 + for clause in recall_should
  1129 + if clause.get("knn", {}).get("_name") == "knn_query"
  1130 + )
  1131 + image_knn_clause = next(
  1132 + clause["nested"]["query"]["knn"]
  1133 + for clause in recall_should
  1134 + if clause.get("nested", {}).get("_name") == "image_knn_query"
  1135 + )
  1136 + exact_text_clause = next(
  1137 + clause["script_score"]
  1138 + for clause in should
  1139 + if clause.get("script_score", {}).get("_name") == "exact_text_knn_query"
  1140 + )
  1141 + exact_image_clause = next(
  1142 + clause["nested"]["query"]["script_score"]
  1143 + for clause in should
  1144 + if clause.get("nested", {}).get("_name") == "exact_image_knn_query"
  1145 + )
  1146 + assert text_knn_clause["boost"] == 28.0
  1147 + assert exact_text_clause["script"]["params"]["boost"] == text_knn_clause["boost"]
  1148 + assert image_knn_clause["boost"] == 20.0
  1149 + assert exact_image_clause["script"]["params"]["boost"] == image_knn_clause["boost"]
  1150 +
  1151 +
  1152 +def test_searcher_skips_exact_knn_rescore_outside_rank_window(monkeypatch):
  1153 + class _VectorQueryParser:
  1154 + def parse(self, query: str, tenant_id: str, generate_vector: bool, context: Any, target_languages: Any = None):
  1155 + return _FakeParsedQuery(
  1156 + original_query=query,
  1157 + query_normalized=query,
  1158 + rewritten_query=query,
  1159 + translations={},
  1160 + query_vector=np.array([0.1, 0.2, 0.3], dtype=np.float32),
  1161 + )
  1162 +
  1163 + es_client = _FakeESClient(total_hits=20)
  1164 + base = _build_search_config(
  1165 + rerank_enabled=True,
  1166 + rerank_window=5,
  1167 + exact_knn_rescore_enabled=True,
  1168 + exact_knn_rescore_window=4,
  1169 + )
  1170 + config = SearchConfig(
  1171 + field_boosts=base.field_boosts,
  1172 + indexes=base.indexes,
  1173 + query_config=QueryConfig(
  1174 + enable_text_embedding=True,
  1175 + enable_query_rewrite=False,
  1176 + text_embedding_field="title_embedding",
  1177 + ),
  1178 + function_score=base.function_score,
  1179 + coarse_rank=base.coarse_rank,
  1180 + fine_rank=FineRankConfig(enabled=False, input_window=5, output_window=5),
  1181 + rerank=base.rerank,
  1182 + spu_config=base.spu_config,
  1183 + es_index_name=base.es_index_name,
  1184 + es_settings=base.es_settings,
  1185 + )
  1186 + searcher = _build_searcher(config, es_client)
  1187 + searcher.query_parser = _VectorQueryParser()
  1188 + context = create_request_context(reqid="exact-rescore-off", uid="u-exact-off")
  1189 +
  1190 + monkeypatch.setattr(
  1191 + "search.searcher.get_tenant_config_loader",
  1192 + lambda: SimpleNamespace(get_tenant_config=lambda tenant_id: {"index_languages": ["en"]}),
  1193 + )
  1194 +
  1195 + searcher.search(
  1196 + query="dress",
  1197 + tenant_id="162",
  1198 + from_=5,
  1199 + size=2,
  1200 + context=context,
  1201 + enable_rerank=False,
  1202 + )
  1203 +
  1204 + body = es_client.calls[0]["body"]
  1205 + assert "rescore" not in body
  1206 +
  1207 +
950 def test_searcher_rerank_rank_change_falls_back_to_coarse_rank_when_fine_disabled(monkeypatch): 1208 def test_searcher_rerank_rank_change_falls_back_to_coarse_rank_when_fine_disabled(monkeypatch):
951 es_client = _FakeESClient(total_hits=5) 1209 es_client = _FakeESClient(total_hits=5)
952 config = _build_search_config(rerank_enabled=True, rerank_window=5) 1210 config = _build_search_config(rerank_enabled=True, rerank_window=5)
@@ -970,6 +1228,12 @@ def test_searcher_rerank_rank_change_falls_back_to_coarse_rank_when_fine_disable @@ -970,6 +1228,12 @@ def test_searcher_rerank_rank_change_falls_back_to_coarse_rank_when_fine_disable
970 lambda: SimpleNamespace(get_tenant_config=lambda tenant_id: {"index_languages": ["en"]}), 1228 lambda: SimpleNamespace(get_tenant_config=lambda tenant_id: {"index_languages": ["en"]}),
971 ) 1229 )
972 1230
  1231 + fine_called: Dict[str, int] = {"count": 0}
  1232 +
  1233 + def _fake_run_lightweight_rerank(**kwargs):
  1234 + fine_called["count"] += 1
  1235 + return [], {"stage": "fine"}, []
  1236 +
973 def _fake_run_rerank(**kwargs): 1237 def _fake_run_rerank(**kwargs):
974 hits = kwargs["es_response"]["hits"]["hits"] 1238 hits = kwargs["es_response"]["hits"]["hits"]
975 hits.reverse() 1239 hits.reverse()
@@ -994,6 +1258,7 @@ def test_searcher_rerank_rank_change_falls_back_to_coarse_rank_when_fine_disable @@ -994,6 +1258,7 @@ def test_searcher_rerank_rank_change_falls_back_to_coarse_rank_when_fine_disable
994 ) 1258 )
995 return kwargs["es_response"], {"model": "final-reranker"}, fused_debug 1259 return kwargs["es_response"], {"model": "final-reranker"}, fused_debug
996 1260
  1261 + monkeypatch.setattr("search.rerank_client.run_lightweight_rerank", _fake_run_lightweight_rerank)
997 monkeypatch.setattr("search.rerank_client.run_rerank", _fake_run_rerank) 1262 monkeypatch.setattr("search.rerank_client.run_rerank", _fake_run_rerank)
998 1263
999 result = searcher.search( 1264 result = searcher.search(
@@ -1008,7 +1273,12 @@ def test_searcher_rerank_rank_change_falls_back_to_coarse_rank_when_fine_disable @@ -1008,7 +1273,12 @@ def test_searcher_rerank_rank_change_falls_back_to_coarse_rank_when_fine_disable
1008 1273
1009 per_result = {row["spu_id"]: row for row in result.debug_info["per_result"]} 1274 per_result = {row["spu_id"]: row for row in result.debug_info["per_result"]}
1010 moved = per_result["4"]["ranking_funnel"] 1275 moved = per_result["4"]["ranking_funnel"]
1011 - assert moved["fine_rank"]["rank"] is None 1276 + assert fine_called["count"] == 0
  1277 + assert result.debug_info["fine_rank"]["enabled"] is False
  1278 + assert result.debug_info["fine_rank"]["applied"] is False
  1279 + assert result.debug_info["fine_rank"]["skipped_reason"] == "disabled"
  1280 + assert moved["fine_rank"]["rank"] == 5
  1281 + assert moved["fine_rank"]["rank_change"] == 0
1012 assert moved["rerank"]["rank"] == 1 1282 assert moved["rerank"]["rank"] == 1
1013 assert moved["rerank"]["rank_change"] == 4 1283 assert moved["rerank"]["rank_change"] == 4
1014 assert moved["final_page"]["rank_change"] == 0 1284 assert moved["final_page"]["rank_change"] == 0
tests/test_translation_converter_resolution.py 0 → 100644
@@ -0,0 +1,85 @@ @@ -0,0 +1,85 @@
  1 +from __future__ import annotations
  2 +
  3 +import sys
  4 +import types
  5 +
  6 +import pytest
  7 +
  8 +import translation.ct2_conversion as ct2_conversion
  9 +
  10 +
  11 +class _FakeTransformersConverter:
  12 + def __init__(self, model_name_or_path):
  13 + self.model_name_or_path = model_name_or_path
  14 + self.load_calls = []
  15 +
  16 + def load_model(self, model_class, resolved_model_name_or_path, **kwargs):
  17 + self.load_calls.append(
  18 + {
  19 + "model_class": model_class,
  20 + "resolved_model_name_or_path": resolved_model_name_or_path,
  21 + "kwargs": dict(kwargs),
  22 + }
  23 + )
  24 + if "dtype" in kwargs or "torch_dtype" in kwargs:
  25 + raise TypeError("M2M100ForConditionalGeneration.__init__() got an unexpected keyword argument 'dtype'")
  26 + return {"loaded": True, "path": resolved_model_name_or_path}
  27 +
  28 + def convert(self, output_dir, quantization=None, force=False):
  29 + loaded = self.load_model("FakeModel", self.model_name_or_path, dtype="float32")
  30 + return {
  31 + "loaded": loaded,
  32 + "output_dir": output_dir,
  33 + "quantization": quantization,
  34 + "force": force,
  35 + "load_calls": list(self.load_calls),
  36 + }
  37 +
  38 +
  39 +def _install_fake_ctranslate2(monkeypatch, base_converter):
  40 + converters_module = types.ModuleType("ctranslate2.converters")
  41 + converters_module.TransformersConverter = base_converter
  42 + ctranslate2_module = types.ModuleType("ctranslate2")
  43 + ctranslate2_module.converters = converters_module
  44 +
  45 + monkeypatch.setitem(sys.modules, "ctranslate2", ctranslate2_module)
  46 + monkeypatch.setitem(sys.modules, "ctranslate2.converters", converters_module)
  47 +
  48 +
  49 +def test_convert_transformers_model_retries_without_torch_dtype(monkeypatch):
  50 + _install_fake_ctranslate2(monkeypatch, _FakeTransformersConverter)
  51 + fake_transformers = types.ModuleType("transformers")
  52 + fake_transformers.AutoConfig = types.SimpleNamespace(
  53 + from_pretrained=lambda path: types.SimpleNamespace(torch_dtype="float32", path=path)
  54 + )
  55 + monkeypatch.setitem(sys.modules, "transformers", fake_transformers)
  56 +
  57 + result = ct2_conversion.convert_transformers_model("fake-model", "/tmp/out", "float16")
  58 +
  59 + assert result["loaded"] == {"loaded": True, "path": "fake-model"}
  60 + assert result["output_dir"] == "/tmp/out"
  61 + assert result["quantization"] == "float16"
  62 + assert result["force"] is False
  63 + assert len(result["load_calls"]) == 2
  64 + assert result["load_calls"][0] == {
  65 + "model_class": "FakeModel",
  66 + "resolved_model_name_or_path": "fake-model",
  67 + "kwargs": {"dtype": "float32"},
  68 + }
  69 + assert result["load_calls"][1]["model_class"] == "FakeModel"
  70 + assert result["load_calls"][1]["resolved_model_name_or_path"] == "fake-model"
  71 + assert getattr(result["load_calls"][1]["kwargs"]["config"], "torch_dtype", "missing") is None
  72 +
  73 +
  74 +def test_convert_transformers_model_preserves_unrelated_type_errors(monkeypatch):
  75 + class _AlwaysFailingConverter(_FakeTransformersConverter):
  76 + def load_model(self, model_class, resolved_model_name_or_path, **kwargs):
  77 + raise TypeError("different constructor error")
  78 +
  79 + _install_fake_ctranslate2(monkeypatch, _AlwaysFailingConverter)
  80 + fake_transformers = types.ModuleType("transformers")
  81 + fake_transformers.AutoConfig = types.SimpleNamespace(from_pretrained=lambda path: types.SimpleNamespace(path=path))
  82 + monkeypatch.setitem(sys.modules, "transformers", fake_transformers)
  83 +
  84 + with pytest.raises(TypeError, match="different constructor error"):
  85 + ct2_conversion.convert_transformers_model("fake-model", "/tmp/out", "float16")
tests/test_translation_local_backends.py
@@ -201,6 +201,51 @@ def test_nllb_ctranslate2_accepts_finnish_short_code(monkeypatch): @@ -201,6 +201,51 @@ def test_nllb_ctranslate2_accepts_finnish_short_code(monkeypatch):
201 assert backend.translator.last_translate_batch_kwargs["target_prefix"] == [["zho_Hans"]] 201 assert backend.translator.last_translate_batch_kwargs["target_prefix"] == [["zho_Hans"]]
202 202
203 203
  204 +def test_nllb_ctranslate2_falls_back_to_model_id_when_local_dir_is_wrong_type(tmp_path, monkeypatch):
  205 + wrong_dir = tmp_path / "wrong-nllb"
  206 + wrong_dir.mkdir()
  207 + (wrong_dir / "config.json").write_text('{"model_type":"led"}', encoding="utf-8")
  208 +
  209 + monkeypatch.setattr(NLLBCTranslate2TranslationBackend, "_load_runtime", _stub_load_ct2_runtime)
  210 +
  211 + backend = NLLBCTranslate2TranslationBackend(
  212 + name="nllb-200-distilled-600m",
  213 + model_id="facebook/nllb-200-distilled-600M",
  214 + model_dir=str(wrong_dir),
  215 + device="cpu",
  216 + torch_dtype="float32",
  217 + batch_size=1,
  218 + max_input_length=16,
  219 + max_new_tokens=16,
  220 + num_beams=1,
  221 + )
  222 +
  223 + assert backend._model_source() == "facebook/nllb-200-distilled-600M"
  224 + assert backend._tokenizer_source() == "facebook/nllb-200-distilled-600M"
  225 +
  226 +
  227 +def test_nllb_ctranslate2_falls_back_to_model_id_when_local_dir_is_incomplete(tmp_path, monkeypatch):
  228 + incomplete_dir = tmp_path / "incomplete-nllb"
  229 + incomplete_dir.mkdir()
  230 + (incomplete_dir / "ctranslate2-float16").mkdir()
  231 +
  232 + monkeypatch.setattr(NLLBCTranslate2TranslationBackend, "_load_runtime", _stub_load_ct2_runtime)
  233 +
  234 + backend = NLLBCTranslate2TranslationBackend(
  235 + name="nllb-200-distilled-600m",
  236 + model_id="facebook/nllb-200-distilled-600M",
  237 + model_dir=str(incomplete_dir),
  238 + device="cpu",
  239 + torch_dtype="float32",
  240 + batch_size=1,
  241 + max_input_length=16,
  242 + max_new_tokens=16,
  243 + num_beams=1,
  244 + )
  245 +
  246 + assert backend._model_source() == "facebook/nllb-200-distilled-600M"
  247 +
  248 +
204 def test_nllb_resolves_flores_short_tags_and_iso_no(): 249 def test_nllb_resolves_flores_short_tags_and_iso_no():
205 cat = build_nllb_language_catalog(None) 250 cat = build_nllb_language_catalog(None)
206 assert resolve_nllb_language_code("ca", cat) == "cat_Latn" 251 assert resolve_nllb_language_code("ca", cat) == "cat_Latn"
tests/test_translator_failure_semantics.py
@@ -197,6 +197,73 @@ def test_translation_route_log_focuses_on_routing_decision(monkeypatch, caplog): @@ -197,6 +197,73 @@ def test_translation_route_log_focuses_on_routing_decision(monkeypatch, caplog):
197 ] 197 ]
198 198
199 199
  200 +def test_service_skips_failed_backend_but_keeps_healthy_capabilities(monkeypatch):
  201 + monkeypatch.setattr(TranslationCache, "_init_redis_client", staticmethod(lambda: None))
  202 +
  203 + def _fake_create_backend(self, *, name, backend_type, cfg):
  204 + del self, backend_type, cfg
  205 + if name == "broken-nllb":
  206 + raise RuntimeError("broken model dir")
  207 +
  208 + class _Backend:
  209 + model = name
  210 +
  211 + @property
  212 + def supports_batch(self):
  213 + return True
  214 +
  215 + def translate(self, text, target_lang, source_lang=None, scene=None):
  216 + del target_lang, source_lang, scene
  217 + return text
  218 +
  219 + return _Backend()
  220 +
  221 + monkeypatch.setattr(TranslationService, "_create_backend", _fake_create_backend)
  222 + service = TranslationService(
  223 + {
  224 + "service_url": "http://127.0.0.1:6006",
  225 + "timeout_sec": 10.0,
  226 + "default_model": "llm",
  227 + "default_scene": "general",
  228 + "capabilities": {
  229 + "llm": {
  230 + "enabled": True,
  231 + "backend": "llm",
  232 + "model": "dummy-llm",
  233 + "base_url": "https://example.com",
  234 + "timeout_sec": 10.0,
  235 + "use_cache": True,
  236 + },
  237 + "broken-nllb": {
  238 + "enabled": True,
  239 + "backend": "local_nllb",
  240 + "model_id": "dummy",
  241 + "model_dir": "dummy",
  242 + "device": "cpu",
  243 + "torch_dtype": "float32",
  244 + "batch_size": 8,
  245 + "max_input_length": 16,
  246 + "max_new_tokens": 16,
  247 + "num_beams": 1,
  248 + "use_cache": True,
  249 + },
  250 + },
  251 + "cache": {
  252 + "ttl_seconds": 60,
  253 + "sliding_expiration": True,
  254 + },
  255 + }
  256 + )
  257 +
  258 + assert service.available_models == ["llm", "broken-nllb"]
  259 + assert service.loaded_models == ["llm"]
  260 + assert service.failed_models == ["broken-nllb"]
  261 + assert service.backend_errors["broken-nllb"] == "broken model dir"
  262 +
  263 + with pytest.raises(RuntimeError, match="failed to initialize"):
  264 + service.get_backend("broken-nllb")
  265 +
  266 +
200 def test_translation_cache_probe_models_order(): 267 def test_translation_cache_probe_models_order():
201 cfg = {"cache": {"model_quality_tiers": {"low": 10, "high": 50, "mid": 30}}} 268 cfg = {"cache": {"model_quality_tiers": {"low": 10, "high": 50, "mid": 30}}}
202 assert translation_cache_probe_models(cfg, "low") == ["high", "mid", "low"] 269 assert translation_cache_probe_models(cfg, "low") == ["high", "mid", "low"]
1 -Subproject commit 03410570d4398084f5ca5c88ad968248e0f3fc5d 1 +Subproject commit 4450c293368655449f14b5fc89e1d06e28d7f307
translation/README.md
@@ -11,9 +11,9 @@ @@ -11,9 +11,9 @@
11 相关脚本与报告: 11 相关脚本与报告:
12 - 启动脚本:[`scripts/start_translator.sh`](/data/saas-search/scripts/start_translator.sh) 12 - 启动脚本:[`scripts/start_translator.sh`](/data/saas-search/scripts/start_translator.sh)
13 - 虚拟环境:[`scripts/setup_translator_venv.sh`](/data/saas-search/scripts/setup_translator_venv.sh) 13 - 虚拟环境:[`scripts/setup_translator_venv.sh`](/data/saas-search/scripts/setup_translator_venv.sh)
14 -- 模型下载:[`scripts/download_translation_models.py`](/data/saas-search/scripts/download_translation_models.py)  
15 -- 本地模型压测:[`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py)  
16 -- 聚焦压测脚本:[`scripts/benchmark_translation_local_models_focus.py`](/data/saas-search/scripts/benchmark_translation_local_models_focus.py) 14 +- 模型下载:[`scripts/translation/download_translation_models.py`](/data/saas-search/scripts/translation/download_translation_models.py)
  15 +- 本地模型压测:[`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py)
  16 +- 聚焦压测脚本:[`benchmarks/translation/benchmark_translation_local_models_focus.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models_focus.py)
17 - 基线性能报告:[`perf_reports/20260318/translation_local_models/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models/README.md) 17 - 基线性能报告:[`perf_reports/20260318/translation_local_models/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models/README.md)
18 - CT2 扩展报告:[`perf_reports/20260318/translation_local_models_ct2/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2/README.md) 18 - CT2 扩展报告:[`perf_reports/20260318/translation_local_models_ct2/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2/README.md)
19 - CT2 聚焦调优报告:[`perf_reports/20260318/translation_local_models_ct2_focus/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2_focus/README.md) 19 - CT2 聚焦调优报告:[`perf_reports/20260318/translation_local_models_ct2_focus/README.md`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2_focus/README.md)
@@ -493,7 +493,7 @@ cd /data/saas-search @@ -493,7 +493,7 @@ cd /data/saas-search
493 下载全部本地模型: 493 下载全部本地模型:
494 494
495 ```bash 495 ```bash
496 -./.venv-translator/bin/python scripts/download_translation_models.py --all-local 496 +./.venv-translator/bin/python scripts/translation/download_translation_models.py --all-local
497 ``` 497 ```
498 498
499 下载完成后,默认目录应存在: 499 下载完成后,默认目录应存在:
@@ -550,8 +550,8 @@ curl -X POST http://127.0.0.1:6006/translate \ @@ -550,8 +550,8 @@ curl -X POST http://127.0.0.1:6006/translate \
550 - 切换到 CTranslate2 后需要重新跑一轮基准,尤其关注 `nllb-200-distilled-600m` 的单条延迟、并发 tail latency 和 `opus-mt-*` 的 batch throughput。 550 - 切换到 CTranslate2 后需要重新跑一轮基准,尤其关注 `nllb-200-distilled-600m` 的单条延迟、并发 tail latency 和 `opus-mt-*` 的 batch throughput。
551 551
552 性能脚本: 552 性能脚本:
553 -- [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py)  
554 -- [`scripts/benchmark_translation_local_models_focus.py`](/data/saas-search/scripts/benchmark_translation_local_models_focus.py) 553 +- [`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py)
  554 +- [`benchmarks/translation/benchmark_translation_local_models_focus.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models_focus.py)
555 555
556 数据集: 556 数据集:
557 - [`products_analyzed.csv`](/data/saas-search/products_analyzed.csv) 557 - [`products_analyzed.csv`](/data/saas-search/products_analyzed.csv)
@@ -601,14 +601,14 @@ curl -X POST http://127.0.0.1:6006/translate \ @@ -601,14 +601,14 @@ curl -X POST http://127.0.0.1:6006/translate \
601 601
602 ```bash 602 ```bash
603 cd /data/saas-search 603 cd /data/saas-search
604 -./.venv-translator/bin/python scripts/benchmark_translation_local_models.py 604 +./.venv-translator/bin/python benchmarks/translation/benchmark_translation_local_models.py
605 ``` 605 ```
606 606
607 本轮扩展压测复现命令: 607 本轮扩展压测复现命令:
608 608
609 ```bash 609 ```bash
610 cd /data/saas-search 610 cd /data/saas-search
611 -./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \ 611 +./.venv-translator/bin/python benchmarks/translation/benchmark_translation_local_models.py \
612 --suite extended \ 612 --suite extended \
613 --disable-cache \ 613 --disable-cache \
614 --serial-items-per-case 256 \ 614 --serial-items-per-case 256 \
@@ -620,7 +620,7 @@ cd /data/saas-search @@ -620,7 +620,7 @@ cd /data/saas-search
620 单模型扩展压测示例: 620 单模型扩展压测示例:
621 621
622 ```bash 622 ```bash
623 -./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \ 623 +./.venv-translator/bin/python benchmarks/translation/benchmark_translation_local_models.py \
624 --single \ 624 --single \
625 --suite extended \ 625 --suite extended \
626 --model opus-mt-zh-en \ 626 --model opus-mt-zh-en \
@@ -639,7 +639,7 @@ cd /data/saas-search @@ -639,7 +639,7 @@ cd /data/saas-search
639 单条请求延迟复现: 639 单条请求延迟复现:
640 640
641 ```bash 641 ```bash
642 -./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \ 642 +./.venv-translator/bin/python benchmarks/translation/benchmark_translation_local_models.py \
643 --single \ 643 --single \
644 --suite extended \ 644 --suite extended \
645 --model nllb-200-distilled-600m \ 645 --model nllb-200-distilled-600m \
translation/backends/local_ctranslate2.py
@@ -4,9 +4,7 @@ from __future__ import annotations @@ -4,9 +4,7 @@ from __future__ import annotations
4 4
5 import logging 5 import logging
6 import os 6 import os
7 -import shutil  
8 -import subprocess  
9 -import sys 7 +import json
10 import threading 8 import threading
11 from pathlib import Path 9 from pathlib import Path
12 from typing import Dict, List, Optional, Sequence, Union 10 from typing import Dict, List, Optional, Sequence, Union
@@ -24,6 +22,7 @@ from translation.text_splitter import ( @@ -24,6 +22,7 @@ from translation.text_splitter import (
24 join_translated_segments, 22 join_translated_segments,
25 split_text_for_translation, 23 split_text_for_translation,
26 ) 24 )
  25 +from translation.ct2_conversion import convert_transformers_model
27 26
28 logger = logging.getLogger(__name__) 27 logger = logging.getLogger(__name__)
29 28
@@ -76,17 +75,18 @@ def _derive_ct2_model_dir(model_dir: str, compute_type: str) -&gt; str: @@ -76,17 +75,18 @@ def _derive_ct2_model_dir(model_dir: str, compute_type: str) -&gt; str:
76 return str(Path(model_dir).expanduser() / f"ctranslate2-{normalized}") 75 return str(Path(model_dir).expanduser() / f"ctranslate2-{normalized}")
77 76
78 77
79 -def _resolve_converter_binary() -> str:  
80 - candidate = shutil.which("ct2-transformers-converter")  
81 - if candidate:  
82 - return candidate  
83 - venv_candidate = Path(sys.executable).absolute().parent / "ct2-transformers-converter"  
84 - if venv_candidate.exists():  
85 - return str(venv_candidate)  
86 - raise RuntimeError(  
87 - "ct2-transformers-converter was not found. "  
88 - "Ensure ctranslate2 is installed in the active translator environment."  
89 - ) 78 +def _detect_local_model_type(model_dir: str) -> Optional[str]:
  79 + config_path = Path(model_dir).expanduser() / "config.json"
  80 + if not config_path.exists():
  81 + return None
  82 + try:
  83 + with open(config_path, "r", encoding="utf-8") as handle:
  84 + payload = json.load(handle) or {}
  85 + except Exception as exc:
  86 + logger.warning("Failed to inspect local translation config %s: %s", config_path, exc)
  87 + return None
  88 + model_type = str(payload.get("model_type") or "").strip().lower()
  89 + return model_type or None
90 90
91 91
92 class LocalCTranslate2TranslationBackend: 92 class LocalCTranslate2TranslationBackend:
@@ -144,6 +144,7 @@ class LocalCTranslate2TranslationBackend: @@ -144,6 +144,7 @@ class LocalCTranslate2TranslationBackend:
144 self.ct2_decoding_length_extra = int(ct2_decoding_length_extra) 144 self.ct2_decoding_length_extra = int(ct2_decoding_length_extra)
145 self.ct2_decoding_length_min = max(1, int(ct2_decoding_length_min)) 145 self.ct2_decoding_length_min = max(1, int(ct2_decoding_length_min))
146 self._tokenizer_lock = threading.Lock() 146 self._tokenizer_lock = threading.Lock()
  147 + self._local_model_source = self._resolve_local_model_source()
147 self._load_runtime() 148 self._load_runtime()
148 149
149 @property 150 @property
@@ -151,10 +152,44 @@ class LocalCTranslate2TranslationBackend: @@ -151,10 +152,44 @@ class LocalCTranslate2TranslationBackend:
151 return True 152 return True
152 153
153 def _tokenizer_source(self) -> str: 154 def _tokenizer_source(self) -> str:
154 - return self.model_dir if os.path.exists(self.model_dir) else self.model_id 155 + return self._local_model_source or self.model_id
155 156
156 def _model_source(self) -> str: 157 def _model_source(self) -> str:
157 - return self.model_dir if os.path.exists(self.model_dir) else self.model_id 158 + return self._local_model_source or self.model_id
  159 +
  160 + def _expected_local_model_types(self) -> Optional[set[str]]:
  161 + return None
  162 +
  163 + def _resolve_local_model_source(self) -> Optional[str]:
  164 + model_path = Path(self.model_dir).expanduser()
  165 + if not model_path.exists():
  166 + return None
  167 + if not (model_path / "config.json").exists():
  168 + logger.warning(
  169 + "Local translation model_dir is incomplete | model=%s model_dir=%s missing=config.json fallback=model_id",
  170 + self.model,
  171 + model_path,
  172 + )
  173 + return None
  174 +
  175 + expected_types = self._expected_local_model_types()
  176 + if not expected_types:
  177 + return str(model_path)
  178 +
  179 + detected_type = _detect_local_model_type(str(model_path))
  180 + if detected_type is None:
  181 + return str(model_path)
  182 + if detected_type in expected_types:
  183 + return str(model_path)
  184 +
  185 + logger.warning(
  186 + "Local translation model_dir has unexpected model_type | model=%s model_dir=%s detected=%s expected=%s fallback=model_id",
  187 + self.model,
  188 + model_path,
  189 + detected_type,
  190 + sorted(expected_types),
  191 + )
  192 + return None
158 193
159 def _tokenizer_kwargs(self) -> Dict[str, object]: 194 def _tokenizer_kwargs(self) -> Dict[str, object]:
160 return {} 195 return {}
@@ -204,7 +239,6 @@ class LocalCTranslate2TranslationBackend: @@ -204,7 +239,6 @@ class LocalCTranslate2TranslationBackend:
204 ) 239 )
205 240
206 ct2_path.parent.mkdir(parents=True, exist_ok=True) 241 ct2_path.parent.mkdir(parents=True, exist_ok=True)
207 - converter = _resolve_converter_binary()  
208 logger.info( 242 logger.info(
209 "Converting translation model to CTranslate2 | name=%s source=%s output=%s quantization=%s", 243 "Converting translation model to CTranslate2 | name=%s source=%s output=%s quantization=%s",
210 self.model, 244 self.model,
@@ -213,25 +247,14 @@ class LocalCTranslate2TranslationBackend: @@ -213,25 +247,14 @@ class LocalCTranslate2TranslationBackend:
213 self.ct2_conversion_quantization, 247 self.ct2_conversion_quantization,
214 ) 248 )
215 try: 249 try:
216 - subprocess.run(  
217 - [  
218 - converter,  
219 - "--model",  
220 - model_source,  
221 - "--output_dir",  
222 - str(ct2_path),  
223 - "--quantization",  
224 - self.ct2_conversion_quantization,  
225 - ],  
226 - check=True,  
227 - stdout=subprocess.PIPE,  
228 - stderr=subprocess.PIPE,  
229 - text=True, 250 + convert_transformers_model(
  251 + model_source,
  252 + str(ct2_path),
  253 + self.ct2_conversion_quantization,
230 ) 254 )
231 - except subprocess.CalledProcessError as exc:  
232 - stderr = exc.stderr.strip() 255 + except Exception as exc:
233 raise RuntimeError( 256 raise RuntimeError(
234 - f"Failed to convert model '{self.model}' to CTranslate2: {stderr or exc}" 257 + f"Failed to convert model '{self.model}' to CTranslate2: {exc}"
235 ) from exc 258 ) from exc
236 259
237 def _normalize_texts(self, text: Union[str, Sequence[str]]) -> List[str]: 260 def _normalize_texts(self, text: Union[str, Sequence[str]]) -> List[str]:
@@ -557,6 +580,9 @@ class MarianCTranslate2TranslationBackend(LocalCTranslate2TranslationBackend): @@ -557,6 +580,9 @@ class MarianCTranslate2TranslationBackend(LocalCTranslate2TranslationBackend):
557 f"Model '{self.model}' only supports target languages: {sorted(self.target_langs)}" 580 f"Model '{self.model}' only supports target languages: {sorted(self.target_langs)}"
558 ) 581 )
559 582
  583 + def _expected_local_model_types(self) -> Optional[set[str]]:
  584 + return {"marian"}
  585 +
560 586
561 class NLLBCTranslate2TranslationBackend(LocalCTranslate2TranslationBackend): 587 class NLLBCTranslate2TranslationBackend(LocalCTranslate2TranslationBackend):
562 """Local backend for NLLB models on CTranslate2.""" 588 """Local backend for NLLB models on CTranslate2."""
@@ -619,6 +645,9 @@ class NLLBCTranslate2TranslationBackend(LocalCTranslate2TranslationBackend): @@ -619,6 +645,9 @@ class NLLBCTranslate2TranslationBackend(LocalCTranslate2TranslationBackend):
619 if resolve_nllb_language_code(target_lang, self.language_codes) is None: 645 if resolve_nllb_language_code(target_lang, self.language_codes) is None:
620 raise ValueError(f"Unsupported NLLB target language: {target_lang}") 646 raise ValueError(f"Unsupported NLLB target language: {target_lang}")
621 647
  648 + def _expected_local_model_types(self) -> Optional[set[str]]:
  649 + return {"m2m_100", "nllb_moe"}
  650 +
622 def _get_tokenizer_for_source(self, source_lang: str): 651 def _get_tokenizer_for_source(self, source_lang: str):
623 src_code = resolve_nllb_language_code(source_lang, self.language_codes) 652 src_code = resolve_nllb_language_code(source_lang, self.language_codes)
624 if src_code is None: 653 if src_code is None:
translation/ct2_conversion.py 0 → 100644
@@ -0,0 +1,52 @@ @@ -0,0 +1,52 @@
  1 +"""Helpers for converting Hugging Face translation models to CTranslate2."""
  2 +
  3 +from __future__ import annotations
  4 +
  5 +import copy
  6 +import logging
  7 +
  8 +logger = logging.getLogger(__name__)
  9 +
  10 +
  11 +def convert_transformers_model(
  12 + model_name_or_path: str,
  13 + output_dir: str,
  14 + quantization: str,
  15 + *,
  16 + force: bool = False,
  17 +) -> str:
  18 + from ctranslate2.converters import TransformersConverter
  19 + from transformers import AutoConfig
  20 +
  21 + class _CompatibleTransformersConverter(TransformersConverter):
  22 + def load_model(self, model_class, resolved_model_name_or_path, **kwargs):
  23 + try:
  24 + return super().load_model(model_class, resolved_model_name_or_path, **kwargs)
  25 + except TypeError as exc:
  26 + if "unexpected keyword argument 'dtype'" not in str(exc):
  27 + raise
  28 + if kwargs.get("dtype") is None and kwargs.get("torch_dtype") is None:
  29 + raise
  30 +
  31 + logger.warning(
  32 + "Retrying CTranslate2 model load without dtype hints | model=%s class=%s",
  33 + resolved_model_name_or_path,
  34 + getattr(model_class, "__name__", model_class),
  35 + )
  36 + retry_kwargs = dict(kwargs)
  37 + retry_kwargs.pop("dtype", None)
  38 + retry_kwargs.pop("torch_dtype", None)
  39 + config = retry_kwargs.get("config")
  40 + if config is None:
  41 + config = AutoConfig.from_pretrained(resolved_model_name_or_path)
  42 + else:
  43 + config = copy.deepcopy(config)
  44 + if hasattr(config, "dtype"):
  45 + config.dtype = None
  46 + if hasattr(config, "torch_dtype"):
  47 + config.torch_dtype = None
  48 + retry_kwargs["config"] = config
  49 + return super().load_model(model_class, resolved_model_name_or_path, **retry_kwargs)
  50 +
  51 + converter = _CompatibleTransformersConverter(model_name_or_path)
  52 + return converter.convert(output_dir=output_dir, quantization=quantization, force=force)
translation/service.py
@@ -31,7 +31,12 @@ class TranslationService: @@ -31,7 +31,12 @@ class TranslationService:
31 if not self._enabled_capabilities: 31 if not self._enabled_capabilities:
32 raise ValueError("No enabled translation backends found in services.translation.capabilities") 32 raise ValueError("No enabled translation backends found in services.translation.capabilities")
33 self._translation_cache = TranslationCache(self.config["cache"]) 33 self._translation_cache = TranslationCache(self.config["cache"])
34 - self._backends = self._initialize_backends() 34 + self._backends: Dict[str, TranslationBackendProtocol] = {}
  35 + self._backend_errors: Dict[str, str] = {}
  36 + self._initialize_backends()
  37 + if not self._backends:
  38 + details = ", ".join(f"{name}: {err}" for name, err in sorted(self._backend_errors.items())) or "unknown error"
  39 + raise RuntimeError(f"No translation backends could be initialized: {details}")
35 40
36 def _collect_enabled_capabilities(self) -> Dict[str, Dict[str, object]]: 41 def _collect_enabled_capabilities(self) -> Dict[str, Dict[str, object]]:
37 enabled: Dict[str, Dict[str, object]] = {} 42 enabled: Dict[str, Dict[str, object]] = {}
@@ -62,24 +67,47 @@ class TranslationService: @@ -62,24 +67,47 @@ class TranslationService:
62 raise ValueError(f"Unsupported translation backend '{backend_type}' for capability '{name}'") 67 raise ValueError(f"Unsupported translation backend '{backend_type}' for capability '{name}'")
63 return factory(name=name, cfg=cfg) 68 return factory(name=name, cfg=cfg)
64 69
65 - def _initialize_backends(self) -> Dict[str, TranslationBackendProtocol]:  
66 - backends: Dict[str, TranslationBackendProtocol] = {}  
67 - for name, capability_cfg in self._enabled_capabilities.items():  
68 - backend_type = str(capability_cfg["backend"])  
69 - logger.info("Initializing translation backend | model=%s backend=%s", name, backend_type)  
70 - backends[name] = self._create_backend( 70 + def _load_backend(self, name: str) -> Optional[TranslationBackendProtocol]:
  71 + capability_cfg = self._enabled_capabilities.get(name)
  72 + if capability_cfg is None:
  73 + return None
  74 + if name in self._backends:
  75 + return self._backends[name]
  76 +
  77 + backend_type = str(capability_cfg["backend"])
  78 + logger.info("Initializing translation backend | model=%s backend=%s", name, backend_type)
  79 + try:
  80 + backend = self._create_backend(
71 name=name, 81 name=name,
72 backend_type=backend_type, 82 backend_type=backend_type,
73 cfg=capability_cfg, 83 cfg=capability_cfg,
74 ) 84 )
75 - logger.info(  
76 - "Translation backend initialized | model=%s backend=%s use_cache=%s backend_model=%s", 85 + except Exception as exc:
  86 + error_text = str(exc).strip() or exc.__class__.__name__
  87 + self._backend_errors[name] = error_text
  88 + logger.error(
  89 + "Translation backend initialization failed | model=%s backend=%s error=%s",
77 name, 90 name,
78 backend_type, 91 backend_type,
79 - bool(capability_cfg.get("use_cache")),  
80 - getattr(backends[name], "model", name), 92 + error_text,
  93 + exc_info=True,
81 ) 94 )
82 - return backends 95 + return None
  96 +
  97 + self._backends[name] = backend
  98 + self._backend_errors.pop(name, None)
  99 + logger.info(
  100 + "Translation backend initialized | model=%s backend=%s use_cache=%s backend_model=%s",
  101 + name,
  102 + backend_type,
  103 + bool(capability_cfg.get("use_cache")),
  104 + getattr(backend, "model", name),
  105 + )
  106 + return backend
  107 +
  108 + def _initialize_backends(self) -> None:
  109 + for name, capability_cfg in self._enabled_capabilities.items():
  110 + self._load_backend(name)
83 111
84 def _create_qwen_mt_backend(self, *, name: str, cfg: Dict[str, object]) -> TranslationBackendProtocol: 112 def _create_qwen_mt_backend(self, *, name: str, cfg: Dict[str, object]) -> TranslationBackendProtocol:
85 from translation.backends.qwen_mt import QwenMTTranslationBackend 113 from translation.backends.qwen_mt import QwenMTTranslationBackend
@@ -178,13 +206,27 @@ class TranslationService: @@ -178,13 +206,27 @@ class TranslationService:
178 def loaded_models(self) -> List[str]: 206 def loaded_models(self) -> List[str]:
179 return list(self._backends.keys()) 207 return list(self._backends.keys())
180 208
  209 + @property
  210 + def failed_models(self) -> List[str]:
  211 + return list(self._backend_errors.keys())
  212 +
  213 + @property
  214 + def backend_errors(self) -> Dict[str, str]:
  215 + return dict(self._backend_errors)
  216 +
181 def get_backend(self, model: Optional[str] = None) -> TranslationBackendProtocol: 217 def get_backend(self, model: Optional[str] = None) -> TranslationBackendProtocol:
182 normalized = normalize_translation_model(self.config, model) 218 normalized = normalize_translation_model(self.config, model)
183 - backend = self._backends.get(normalized) 219 + backend = self._backends.get(normalized) or self._load_backend(normalized)
184 if backend is None: 220 if backend is None:
185 - raise ValueError(  
186 - f"Translation model '{normalized}' is not enabled. "  
187 - f"Available models: {', '.join(self.available_models) or 'none'}" 221 + if normalized not in self._enabled_capabilities:
  222 + raise ValueError(
  223 + f"Translation model '{normalized}' is not enabled. "
  224 + f"Available models: {', '.join(self.available_models) or 'none'}"
  225 + )
  226 + error_text = self._backend_errors.get(normalized) or "unknown initialization error"
  227 + raise RuntimeError(
  228 + f"Translation model '{normalized}' failed to initialize: {error_text}. "
  229 + f"Loaded models: {', '.join(self.loaded_models) or 'none'}"
188 ) 230 )
189 return backend 231 return backend
190 232