Compare View

switch
from
...
to
 
Commits (19)
  • Previously, both `b` and `k1` were set to `0.0`. The original intention
    was to avoid two common issues in e-commerce search relevance:
    
    1. Over-penalizing longer product titles
       In product search, a shorter title should not automatically rank
    higher just because BM25 favors shorter fields. For example, for a query
    like “遥控车”, a product whose title is simply “遥控车” is not
    necessarily a better candidate than a product with a slightly longer but
    more descriptive title. In practice, extremely short titles may even
    indicate lower-quality catalog data.
    
    2. Over-rewarding repeated occurrences of the same term
       For longer queries such as “遥控喷雾翻滚多功能车玩具车”, the default
    BM25 behavior may give too much weight to a term that appears multiple
    times (for example “遥控”), even when other important query terms such
    as “喷雾” or “翻滚” are missing. This can cause products with repeated
    partial matches to outrank products that actually cover more of the user
    intent.
    
    Setting both parameters to zero was an intentional way to suppress
    length normalization and term-frequency amplification. However, after
    introducing a `combined_fields` query, this configuration becomes too
    aggressive. Since `combined_fields` scores multiple fields as a unified
    relevance signal, completely disabling both effects may also remove
    useful ranking information, especially when we still want documents
    matching more query terms across fields to be distinguishable from
    weaker matches.
    
    This update therefore relaxes the previous setting and reintroduces a
    controlled amount of BM25 normalization/scoring behavior. The goal is to
    keep the original intent — avoiding short-title bias and excessive
    repeated-term gain — while allowing the combined query to better
    preserve meaningful relevance differences across candidates.
    
    Expected effect:
    - reduce the bias toward unnaturally short product titles
    - limit score inflation caused by repeated occurrences of the same term
    - improve ranking stability for `combined_fields` queries
    - better reward candidates that cover more of the overall query intent,
      instead of those that only repeat a subset of terms
    tangwang
     
  • tangwang
     
  • tangwang
     
  • tangwang
     
  • tangwang
     
  • tangwang
     
  • 字段生成
    
    - 新增分类法属性富化能力,遵循 enriched_attributes
      相同的字段结构和处理逻辑,仅提示词和解析维度不同
    - 引入 AnalysisSchema
      抽象类,使内容富化(content)与分类法富化(taxonomy)共享批处理、缓存、提示词构建、Markdown
    解析及归一化流程
    - 重构 product_enrich.py 中原有的富化管道,将通用逻辑抽取至
      _process_batch_for_schema、_parse_markdown_to_attributes
    等函数,消除代码重复
    - 在 product_enrich_prompts.py
      中添加分类法提示词模板(TAXONOMY_ANALYSIS_PROMPT)及 Markdown
    表头定义(TAXONOMY_HEADERS)
    - 修复 Markdown
      解析器在空单元格时的行为:原实现会跳过空单元格导致列错位,现改为保留空值,确保稀疏的分类法属性列正确对齐
    - 更新 document_transformer.py 中 build_index_content_fields 函数,将
      enriched_taxonomy_attributes(中/英)写入最终索引文档
    - 调整相关单元测试(test_product_enrich_partial_mode.py
      等)以覆盖新字段路径,测试通过(14 passed)
    
    技术细节:
    - AnalysisSchema 包含
      schema_name、prompt_template、headers、field_name_prefix 等元数据
    -
    缓存键区分内容/分类法:`enrich:{schema_name}:{product_id}`,避免缓存污染
    - 分类法解析使用与 enriched_attributes
      相同的嵌套结构:`{"attribute_key": "value"}`,支持多行表格
    - 批处理大小与重试逻辑保持与原有内容富化一致
    tangwang
     
  • - `/indexer/enrich-content` 路由`enriched_taxonomy_attributes` 与
      `enriched_attributes` 一并返回
    - 新增请求参数 `analysis_kinds`(可选,默认 `["content",
      "taxonomy"]`),允许调用方按需选择内容分析类型,为后续扩展和成本控制预留空间
    - 重构缓存策略:将 `content` 与 `taxonomy` 两类分析的缓存完全隔离,缓存
      key 包含 prompt 模板、表头、输出字段定义(即 schema
    指纹),确保提示词或解析规则变更时自动失效
    - 缓存 key 仅依赖真正参与 LLM
      输入的字段(`title`、`brief`、`description`),`image_url`、`tenant_id`、`spu_id`
    不再污染缓存键,提高缓存命中率
    - 更新 API
      文档(`docs/搜索API对接指南-05-索引接口(Indexer).md`),说明新增参数与返回字段
    
    技术细节:
    - 路由层调整:在 `api/routes/indexer.py` 的 enrich-content 端点中,将
      `product_enrich.enrich_products_batch` 返回的
    `enriched_taxonomy_attributes` 字段显式加入 HTTP 响应体
    - `analysis_kinds` 参数透传至底层
      `enrich_products_batch`,支持按需跳过某一类分析(如仅需 taxonomy
    时减少 LLM 调用)
    - 缓存指纹计算位于 `product_enrich.py` 的 `_get_cache_key` 函数,对每种
      `AnalysisSchema` 独立生成;版本号通过 `schema.version` 或 prompt
    内容哈希隐式包含
    - 测试覆盖:新增 `analysis_kinds` 组合场景及缓存隔离测试
    tangwang
     
  • category_taxonomy_profile
    
    - 原 analysis_kinds
      混用了“增强类型”(content/taxonomy)与“品类特定配置”,不利于扩展不同品类的
    taxonomy 分析(如 3C、家居等)
    - 新增 enrichment_scopes 参数:支持 generic(通用增强,产出
      qanchors/enriched_tags/enriched_attributes)和
    category_taxonomy(品类增强,产出 enriched_taxonomy_attributes)
    - 新增 category_taxonomy_profile 参数:指定品类增强使用哪套
      profile(当前内置 apparel),每套 profile 包含独立的
    prompt、输出列定义、解析规则及缓存版本
    - 保留 analysis_kinds 作为兼容别名,避免破坏现有调用方
    - 重构内部 taxonomy 分析为 profile registry 模式:新增
      _get_taxonomy_schema(profile_name) 函数,根据 profile 动态返回对应的
    AnalysisSchema
    - 缓存 key 现在按“分析类型 + profile + schema 指纹 +
      输入字段哈希”隔离,确保不同品类、不同 prompt 版本自动失效
    - 更新 API 文档及微服务接口文档,明确新参数语义与使用示例
    
    技术细节:
    - 修改入口:api/routes/indexer.py 中 enrich-content
      端点,解析新参数并向下传递
    - 核心逻辑:indexer/product_enrich.py 中 enrich_products_batch 增加
      profile 参数;_process_batch_for_schema 根据 scope 和 profile 动态获取
    schema
    - 兼容层:若请求同时提供 analysis_kinds,则映射为
      enrichment_scopes(content→generic,taxonomy→category_taxonomy),category_taxonomy_profile
    默认为 "apparel"
    - 测试覆盖:新增 enrichment_scopes 组合、profile 切换及兼容模式测试
    tangwang
     
  • 本次迭代对检索系统的内容复化模块进行了较大规模的重构,将原先硬编码的“仅服饰(apparel)”品类拓展至
    taxonomy.md
    中定义的所有品类,同时优化了代码结构,降低了扩展新品类的成本。核心设计采用注册表模式(profile
    registry),按品类 profile
    分组进行批处理,并明确区分双语(zh+en)与仅英文(en)输出策略。
    
    【修改内容】
    
    1. 品类支持范围扩展
       -
    新增支持的品类:3c、bags、pet_supplies、electronics、outdoor、home_appliances、home_living、wigs、beauty、accessories、toys、shoes、sports、others
       - 所有新品类在 taxonomy 输出阶段仅返回 en 字段,避免多语言字段膨胀
       - 保留服饰(apparel)品类的双语输出(zh + en),维持原有业务兼容性
    
    2. 核心代码重构
       - `indexer/product_enrich.py`
         - 新增 `TAXONOMY_PROFILES`
           注册表,以数据驱动方式定义每个品类的输出语言、prompt
    映射、taxonomy 字段集合
         - 重写 `_enrich_taxonomy_batch`:按 profile 分组批量调用
           LLM,避免为每个品类编写独立分支
         - 引入 `_infer_profile_from_category()` 函数,从 SPU 的 category
           字段自动推断所属 profile(用于内部索引路径,解决混合目录默认
    fallback 到服饰的问题)
       - `indexer/product_enrich_prompts.py`
         - 将原有单一服饰 prompt 重构为 `PROMPT_TEMPLATES` 字典,按 profile
           存储不同提示词
         - 所有非服饰品类共享一套精简提示模板,仅要求输出 en 字段
       - `indexer/document_transformer.py`
         - 在构建 enrichment 请求时传递 category 信息,供下游按 profile 路由
         - 调整 `_build_enrich_batch` 逻辑,使批量请求支持混合品类并正确分组
       - `indexer/indexer.py`(API 层)
         - `/indexer/enrich-content` 接口的请求模型增加可选的
           `category_profile`
    字段,允许调用方显式指定品类;未指定时由服务端自动推断
         - 更新参数校验与错误处理,新增对 `others` 等兜底品类的支持
    
    3. 文档同步更新
       - `docs/搜索API对接指南-05-索引接口(Indexer).md`:增加品类 profile
         参数说明,标注非服饰品类 taxonomy 仅返回 en 字段
       -
    `docs/搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md`:更新
    enrichment 微服务的调用示例,体现多品类分组批处理
       - `taxonomy.md`:补充各品类的字段清单,明确 en
         字段为所有非服饰品类的唯一输出
    
    【技术细节】
    
    - **注册表设计**:
      ```python
      TAXONOMY_PROFILES = {
          "apparel": {"lang": ["zh", "en"], "prompt_key": "apparel",
    "fields": [...]},
          "3c": {"lang": ["en"], "prompt_key": "default", "fields": [...]},
          \# ...
      }
      ```
      新增品类只需在注册表中添加一项,并确保 `PROMPT_TEMPLATES` 中存在对应的
    prompt_key,无需修改控制流逻辑。
    
    - **按 profile 分组批处理**:
      - 原有实现:所有产品混在一起,使用同一套服饰
        prompt,导致非服饰产品被错误填充。
      - 重构后:`_enrich_taxonomy_batch` 先根据每个产品的 profile
        分组,每组独立构造 LLM
    请求,响应结果再按原始顺序合并。分组粒度可配置,避免小分组带来的过多请求开销。
    
    - **自动品类推断**:
      - 对于内部索引(非显式调用 enrichment 接口的场景),通过
        `_infer_profile_from_category` 解析 SPU 的 `category_l1/l2/l3`
    字段,映射到最匹配的
    profile。映射规则基于关键词匹配(如“手机”->“3c”,“狗粮”->“pet_supplies”),未匹配时
    fallback 到 `apparel` 以保证系统平稳过渡。
    
    - **输出字段裁剪**:
      - 由于 Elasticsearch mapping 中 `enriched_taxonomy_attributes.value`
        字段仅存储单个值(不分语言),非服饰品类的 LLM
    输出直接写入该字段;服饰品类则使用动态模板 `value.zh` 和
    `value.en`。代码中通过 `_apply_lang_output` 函数统一处理。
    
    - **代码量与可维护性**:
      - 虽然因新增大量品类定义导致总行数略有增长(~+180
        行),但条件分支数量从 5 处减少到 1 处(仅 profile
    查找)。新增品类的平均成本仅为注册表 3 行 + prompt 模板 10
    行,无需改动核心 enrichment 循环。
    
    【影响文件】
    - `indexer/product_enrich.py`
    - `indexer/product_enrich_prompts.py`
    - `indexer/document_transformer.py`
    - `indexer/indexer.py`
    - `docs/搜索API对接指南-05-索引接口(Indexer).md`
    -
    `docs/搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md`
    - `taxonomy.md`
    - `tests/test_product_enrich_partial_mode.py`(适配多 profile 测试用例)
    - `tests/test_llm_enrichment_batch_fill.py`
    - `tests/test_process_products_batching.py`
    
    【测试验证】
    - 执行单元测试与集成测试:`pytest
      tests/test_product_enrich_partial_mode.py
    tests/test_llm_enrichment_batch_fill.py
    tests/test_process_products_batching.py
    tests/ci/test_service_api_contracts.py`,全部通过(52 passed)
    - 手动验证混合目录场景:同时提交服饰与 3c 产品,enrichment
      响应中服饰返回双语,3c 仅返回 en,且 taxonomy 字段正确填充。
    - 编译检查:`py_compile` 所有修改模块无语法错误。
    
    【注意事项】
    - 本次重构未改变现有服饰品类的行为,API 向后兼容(未指定 profile
      时仍按服饰处理)。
    - 若后续需为某品类增加双语支持,只需修改注册表中的 `lang` 列表并补充
      prompt 模板,无需改动其他逻辑。
    tangwang
     
  • 2. 删掉自动推断 taxonomy profile的逻辑,build_index_content_fields()
    3. 所有 taxonomy profile 都输出 zh/en”,并把按行业切语言的逻辑去掉
       只接受显式传入的 category_taxonomy_profile
    tangwang
     
  • 问题背景:
    - scripts/
      目录下混有服务启动、数据转换、性能压测、临时脚本及历史备份目录
    - 存在大量中间迭代遗留信息,不利于维护和新人理解
    - 现行服务编排已稳定为 service_ctl up all 的集合:tei / cnclip /
      embedding / embedding-image / translator / reranker / backend /
    indexer / frontend / eval-web,不再保留 reranker-fine 默认位
    
    调整内容:
    1. 根 scripts/ 收敛为运行、运维、环境、数据处理脚本,并新增
       scripts/README.md 说明文档
    2. 性能/压测/调参脚本整体迁至 benchmarks/ 目录,同步更新
       benchmarks/README.md
    3. 人工试跑脚本迁至 tests/manual/ 目录,同步更新 tests/manual/README.md
    4. 删除明确过时内容:
       - scripts/indexer__old_2025_11/
       - scripts/start.sh
       - scripts/install_server_deps.sh
    5. 同步修正以下文档中的路径及过时描述:
       - 根目录 README.md
       - 性能报告相关文档
       - reranker/translation 模块文档
    
    技术细节:
    - 性能测试不放常规 tests/
      的原因:这类脚本依赖真实服务、GPU、模型和环境噪声,不适合作为稳定回归门禁;benchmarks/
    更贴合其定位
    - tests/manual/ 仅存放需要人工启动依赖、手工观察结果的接口试跑脚本
    - 所有迁移后的 Python 脚本已通过 py_compile 语法校验
    - 所有迁移后的 Shell 脚本已通过 bash -n 语法校验
    
    校验结果:
    - py_compile: 通过
    - bash -n: 通过
    tangwang
     
  •   - 数据转换放到 scripts/data_import/README.md
      - 诊断巡检放到 scripts/inspect/README.md
      - 运维辅助放到 scripts/ops/README.md
      - 前端辅助服务放到 scripts/frontend/frontend_server.py
      - 翻译模型下载放到 scripts/translation/download_translation_models.py
      - 临时图片补 embedding 脚本收敛成
        scripts/maintenance/embed_tenant_image_urls.py
      - Redis 监控脚本并入 redis/,现在是 scripts/redis/monitor_eviction.py
    
      同时我把真实调用链都改到了新位置:
    
      - scripts/start_frontend.sh
      - scripts/start_cnclip_service.sh
      - scripts/service_ctl.sh
      - scripts/setup_translator_venv.sh
      - scripts/README.md
    
      文档里涉及这些脚本的路径也同步修了,主要是 docs/QUICKSTART.md 和
    translation/README.md。
    tangwang
     
  • tangwang
     
  • 2. +service_enabled_by_config() {
    reranker|reranker-fine|translator 如果被关闭,则run.sh all 不启动该服务
    tangwang
     
  • tangwang
     
  • tangwang
     
  •  背景与问题
    - 现有粗排/重排依赖 `knn_query` 和 `image_knn_query` 分数,但这两路分数来自 ANN 召回,并非所有进入 rerank_window (160) 的文档都同时命中文本和图片向量召回,导致部分文档得分为 0,影响融合公式的稳定性。
    - 简单扩大 ANN 的 k 无法保证 lexical 召回带来的文档也包含两路向量分;二次查询或拉回向量本地计算均有额外开销且实现复杂。
    
     解决方案
    采用 ES rescore 机制,在第一次搜索的 `window_size` 内对每个文档执行精确的向量 script_score,并将分数以 named query 形式附加到 `matched_queries` 中,供后续 coarse/rerank 优先使用。
    
    **设计决策**:
    - **只补分,不改排序**:rescore 使用 `score_mode: total` 且 `rescore_query_weight: 0.0`,原始 `_score` 保持不变,避免干扰现有排序逻辑,风险最小。
    - **精确分数命名**:`exact_text_knn_query` 和 `exact_image_knn_query`,便于客户端识别和回退。
    - **可配置**:通过 `exact_knn_rescore_enabled` 开关和 `exact_knn_rescore_window` 控制窗口大小,默认 160。
    
     技术实现细节
    
     1. 配置扩展 (`config/config.yaml`, `config/loader.py`)
    ```yaml
    exact_knn_rescore_enabled: true
    exact_knn_rescore_window: 160
    ```
    新增配置项并注入到 `RerankConfig`。
    
     2. Searcher 构建 rescore 查询 (`search/searcher.py`)
    - 在 `_build_es_search_request` 中,当 `enable_rerank=True` 且配置开启时,构造 rescore 对象:
      - `window_size` = `exact_knn_rescore_window`
      - `query` 为一个 `bool` 查询,内嵌两个 `script_score` 子查询,分别计算文本和图片向量的点积相似度:
        ```painless
        // exact_text_knn_query
        (dotProduct(params.query_vector, 'title_embedding') + 1.0) / 2.0
        // exact_image_knn_query
        (dotProduct(params.image_query_vector, 'image_embedding.vector') + 1.0) / 2.0
        ```
      - 每个 `script_score` 都设置 `_name` 为对应的 named query。
    - 注意:当前实现的脚本分数**尚未乘以 knn_text_boost / knn_image_boost**,保持与原始 ANN 分数尺度对齐的后续待办。
    
     3. RerankClient 优先读取 exact 分数 (`search/rerank_client.py`)
    - 在 `_extract_coarse_signals` 中,从文档的 `matched_queries` 里读取 `exact_text_knn_query` 和 `exact_image_knn_query` 分数。
    - 若存在且值有效,则用作 `text_knn_score` / `image_knn_score`,并标记 `text_knn_source='exact_text_knn_query'`。
    - 若不存在,则回退到原有的 `knn_query` / `image_knn_query` (ANN 分数)。
    - 同时保留原始 ANN 分数到 `approx_text_knn_score` / `approx_image_knn_score` 供调试对比。
    
     4. 调试信息增强
    - `debug_info.per_result[*].ranking_funnel.coarse_rank.signals` 中输出 exact 分数、回退分数及来源标记,便于线上观察覆盖率和数值分布。
    
     验证结果
    - 通过单元测试 `tests/test_rerank_client.py` 和 `tests/test_search_rerank_window.py`,验证 exact 优先级、配置解析及 ES 请求体结构。
    - 线上真实查询采样(6 个 query,top160)显示:
      - **exact 覆盖率达到 100%**(文本和图片均有分),解决了原 ANN 部分缺失的问题。
      - 但 exact 分数与原始 ANN 分数存在量级差异(ANN/exact 中位数比值约 4.1 倍),原因是 exact 脚本未乘 boost 因子。
    - 当前排名影响:粗排 top10 重叠度最低降至 1/10,最大排名漂移超过 100。
    
     后续计划
    1. 对齐 exact 分与 ANN 分的尺度:在 script_score 中乘以 `knn_text_boost` / `knn_image_boost`,并对长查询额外乘 1.4。
    2. 重新评估 top10 重叠度和漂移,若收敛则可将 coarse 融合公式整体迁移至 ES rescore 阶段。
    3. 当前版本保持“只补分不改排序”的安全策略,已解决核心的分数缺失问题。
    
     涉及文件
    - `config/config.yaml`
    - `config/loader.py`
    - `search/searcher.py`
    - `search/rerank_client.py`
    - `tests/test_rerank_client.py`
    - `tests/test_search_rerank_window.py`
    tangwang
     
  •  修改内容
    
    1. **新增配置项** (`config/config.yaml`)
       - `exact_knn_rescore_enabled`: 是否开启精确向量重打分,默认 true
       - `exact_knn_rescore_window`: 重打分窗口大小,默认 160(与 rerank_window 解耦,可独立配置)
    
    2. **ES 查询层改造** (`search/searcher.py`)
       - 在第一次 ES 搜索中,根据配置为 window_size 内的文档注入 rescore 阶段
       - rescore_query 中包含两个 named script_score 子句:
         - `exact_text_knn_query`: 对文本向量执行精确点积
         - `exact_image_knn_query`: 对图片向量执行精确点积
       - 当前采用 `score_mode=total` 且 `rescore_query_weight=0.0`,**只补分不改排序**,exact 分仅出现在 `matched_queries` 中
    
    3. **统一向量得分 Boost 逻辑** (`search/es_query_builder.py`)
       - 新增 `_get_knn_plan()` 方法,集中管理文本/图片 KNN 的 boost 计算规则
       - 支持长查询(token 数超过阈值)时文本 boost 额外乘 1.4 倍
       - 精确 rescore 与 ANN 召回**共用同一套 boost 规则**,确保分数量纲一致
       - 原有 ANN 查询构建逻辑同步迁移至该统一入口
    
    4. **融合阶段得分优先级调整** (`search/rerank_client.py`)
       - `_build_hit_signal_bundle()` 中统一处理向量得分读取
       - 优先从 `matched_queries` 读取 `exact_text_knn_query` / `exact_image_knn_query`
       - 若不存在则回退到原 `knn_query` / `image_knn_query`(ANN 得分)
       - 覆盖 coarse_rank、fine_rank、rerank 三个阶段,避免重复补丁
    
    5. **测试覆盖**
       - `tests/test_es_query_builder.py`: 验证 ANN 与 exact 共用 boost 规则
       - `tests/test_search_rerank_window.py`: 验证 rescore 窗口及 named query 正确注入
       - `tests/test_rerank_client.py`: 验证 exact 优先、回退 ANN 的逻辑
    
     技术细节
    
    - **精确向量计算脚本** (Painless)
      ```painless
      // 文本 (dotProduct + 1.0) / 2.0
      (dotProduct(params.query_vector, 'title_embedding') + 1.0) / 2.0
      // 图片同理,字段为 'image_embedding.vector'
      ```
      乘以统一的 boost(来自配置 `knn_text_boost` / `knn_image_boost` 及长查询放大因子)。
    
    - **named query 保留机制**
      - 主查询中已开启 `include_named_queries_score: true`
      - rescore 阶段命名的脚本得分会合并到每个 hit 的 `matched_queries` 中
      - 通过 `_extract_named_score()` 按名称提取,与原始 ANN 得分访问方式完全一致
    
    - **性能影响** (基于 top160、6 条真实查询、warm-up 后 3 轮平均)
      - `elasticsearch_search_primary` 耗时: 124.71ms → 136.60ms (+11.89ms, +9.53%)
      - `total_search` 受其他组件抖动影响较大,不作为主要参考
      - 该开销在可接受范围内,未出现超时或资源瓶颈
    
     配置示例
    
    ```yaml
    search:
      exact_knn_rescore_enabled: true
      exact_knn_rescore_window: 160
      knn_text_boost: 4.0
      knn_image_boost: 4.0
      long_query_token_threshold: 8
      long_query_text_boost_factor: 1.4
    ```
    
     已知问题与后续计划
    
    - 当前版本经过调参实验发现,开启 exact rescore 后部分 query(强类型约束 + 多风格/颜色相似)的主指标相比 baseline(exact=false)下降约 0.031(0.6009 → 0.5697)
    - 根因:exact 将 KNN 从稀疏辅助信号变为 dense 排序因子,coarse 阶段排序语义变化,单纯调整现有 `knn_bias/exponent` 无法完全恢复
    - 后续迭代方向:**coarse 阶段暂不强制使用 exact**,仅 fine/rerank 优先 exact;或 coarse 采用“ANN 优先,exact 只补缺失”策略,再重新评估
    
     相关文件
    
    - `config/config.yaml`
    - `search/searcher.py`
    - `search/es_query_builder.py`
    - `search/rerank_client.py`
    - `tests/test_es_query_builder.py`
    - `tests/test_search_rerank_window.py`
    - `tests/test_rerank_client.py`
    - `scripts/evaluation/exact_rescore_coarse_tuning_round2.json` (调参实验记录)
    tangwang
     
Showing 137 changed files   Show diff stats

Too many changes.

To preserve performance only 100 of 137 files are displayed.

... ... @@ -4,6 +4,7 @@
4 4 ES_HOST=http://localhost:9200
5 5 ES_USERNAME=saas
6 6 ES_PASSWORD=4hOaLaf41y2VuI8y
  7 +ES_AUTH="${ES_USERNAME}:${ES_PASSWORD}"
7 8  
8 9 # Redis Configuration (Optional) - AI 生产 10.200.16.14:6479
9 10 REDIS_HOST=10.200.16.14
... ...
.env.example
... ... @@ -8,6 +8,7 @@
8 8 ES_HOST=http://localhost:9200
9 9 ES_USERNAME=saas
10 10 ES_PASSWORD=
  11 +ES_AUTH="${ES_USERNAME}:${ES_PASSWORD}"
11 12  
12 13 # Redis (生产默认 10.200.16.14:6479,密码见 docs/QUICKSTART.md §1.6)
13 14 REDIS_HOST=10.200.16.14
... ...
CLAUDE.md
... ... @@ -77,9 +77,11 @@ source activate.sh
77 77 # Generate test data (Tenant1 Mock + Tenant2 CSV)
78 78 ./scripts/mock_data.sh
79 79  
80   -# Ingest data to Elasticsearch
81   -./scripts/ingest.sh <tenant_id> [recreate] # e.g., ./scripts/ingest.sh 1 true
82   -python main.py ingest data.csv --limit 1000 --batch-size 50
  80 +# Create tenant index structure
  81 +./scripts/create_tenant_index.sh <tenant_id>
  82 +
  83 +# Build / refresh suggestion index
  84 +./scripts/build_suggestions.sh <tenant_id> --mode incremental
83 85 ```
84 86  
85 87 ### Running Services
... ... @@ -100,10 +102,10 @@ python main.py serve --host 0.0.0.0 --port 6002 --reload
100 102 # Run all tests
101 103 pytest tests/
102 104  
103   -# Run specific test types
104   -pytest tests/unit/ # Unit tests
105   -pytest tests/integration/ # Integration tests
106   -pytest -m "api" # API tests only
  105 +# Run focused regression sets
  106 +python -m pytest tests/ci -q
  107 +pytest tests/test_rerank_client.py
  108 +pytest tests/test_query_parser_mixed_language.py
107 109  
108 110 # Test search from command line
109 111 python main.py search "query" --tenant-id 1 --size 10
... ... @@ -114,12 +116,8 @@ python main.py search &quot;query&quot; --tenant-id 1 --size 10
114 116 # Stop all services
115 117 ./scripts/stop.sh
116 118  
117   -# Test environment (for CI/development)
118   -./scripts/start_test_environment.sh
119   -./scripts/stop_test_environment.sh
120   -
121   -# Install server dependencies
122   -./scripts/install_server_deps.sh
  119 +# Run CI contract tests
  120 +./scripts/run_ci_tests.sh
123 121 ```
124 122  
125 123 ## Architecture Overview
... ... @@ -585,7 +583,7 @@ GET /admin/stats # Index statistics
585 583 ./scripts/start_frontend.sh # Frontend UI (port 6003)
586 584  
587 585 # Data Operations
588   -./scripts/ingest.sh <tenant_id> [recreate] # Index data
  586 +./scripts/create_tenant_index.sh <tenant_id> # Create tenant index
589 587 ./scripts/mock_data.sh # Generate test data
590 588  
591 589 # Testing
... ...
api/models.py
... ... @@ -154,7 +154,8 @@ class SearchRequest(BaseModel):
154 154 enable_rerank: Optional[bool] = Field(
155 155 None,
156 156 description=(
157   - "是否开启重排(调用外部重排服务对 ES 结果进行二次排序)。"
  157 + "是否开启最终重排(调用外部 rerank 服务改写上一阶段顺序)。"
  158 + "关闭时仍保留 coarse/fine 流程,仅在 rerank 阶段保序透传。"
158 159 "不传则使用服务端配置 rerank.enabled(默认开启)。"
159 160 )
160 161 )
... ...
api/routes/indexer.py
... ... @@ -7,7 +7,7 @@
7 7 import asyncio
8 8 import re
9 9 from fastapi import APIRouter, HTTPException
10   -from typing import Any, Dict, List, Optional
  10 +from typing import Any, Dict, List, Literal, Optional
11 11 from pydantic import BaseModel, Field
12 12 import logging
13 13 from sqlalchemy import text
... ... @@ -19,6 +19,11 @@ logger = logging.getLogger(__name__)
19 19  
20 20 router = APIRouter(prefix="/indexer", tags=["indexer"])
21 21  
  22 +SUPPORTED_CATEGORY_TAXONOMY_PROFILES = (
  23 + "apparel, 3c, bags, pet_supplies, electronics, outdoor, "
  24 + "home_appliances, home_living, wigs, beauty, accessories, toys, shoes, sports, others"
  25 +)
  26 +
22 27  
23 28 class ReindexRequest(BaseModel):
24 29 """全量重建索引请求"""
... ... @@ -88,11 +93,42 @@ class EnrichContentItem(BaseModel):
88 93  
89 94 class EnrichContentRequest(BaseModel):
90 95 """
91   - 内容理解字段生成请求:根据商品标题批量生成 qanchors、enriched_attributes、tags
  96 + 内容理解字段生成请求:根据商品标题批量生成通用增强字段与品类 taxonomy 字段
92 97 供外部 indexer 在自行组织 doc 时调用,与翻译、向量化等微服务并列。
93 98 """
94 99 tenant_id: str = Field(..., description="租户 ID,用于请求路由与结果归属,不参与缓存键")
95 100 items: List[EnrichContentItem] = Field(..., description="待分析的 SPU 列表(spu_id + title,可附带 brief/description/image_url)")
  101 + enrichment_scopes: Optional[List[Literal["generic", "category_taxonomy"]]] = Field(
  102 + default=None,
  103 + description=(
  104 + "要执行的增强范围。"
  105 + "`generic` 返回 qanchors/enriched_tags/enriched_attributes;"
  106 + "`category_taxonomy` 返回 enriched_taxonomy_attributes。"
  107 + "默认两者都执行。"
  108 + ),
  109 + )
  110 + category_taxonomy_profile: str = Field(
  111 + "apparel",
  112 + description=(
  113 + "品类 taxonomy profile。默认 `apparel`。"
  114 + f"当前支持:{SUPPORTED_CATEGORY_TAXONOMY_PROFILES}。"
  115 + "其中除 `apparel` 外,其余 profile 的 taxonomy 输出仅返回 `en`。"
  116 + ),
  117 + )
  118 + analysis_kinds: Optional[List[Literal["content", "taxonomy"]]] = Field(
  119 + default=None,
  120 + description="Deprecated alias of enrichment_scopes. `content` -> `generic`, `taxonomy` -> `category_taxonomy`.",
  121 + )
  122 +
  123 + def resolved_enrichment_scopes(self) -> List[str]:
  124 + if self.enrichment_scopes:
  125 + return list(self.enrichment_scopes)
  126 + if self.analysis_kinds:
  127 + mapped = []
  128 + for item in self.analysis_kinds:
  129 + mapped.append("generic" if item == "content" else "category_taxonomy")
  130 + return mapped
  131 + return ["generic", "category_taxonomy"]
96 132  
97 133  
98 134 @router.post("/reindex")
... ... @@ -440,20 +476,31 @@ async def build_docs_from_db(request: BuildDocsFromDbRequest):
440 476 raise HTTPException(status_code=500, detail=f"Internal server error: {str(e)}")
441 477  
442 478  
443   -def _run_enrich_content(tenant_id: str, items: List[Dict[str, str]]) -> List[Dict[str, Any]]:
  479 +def _run_enrich_content(
  480 + tenant_id: str,
  481 + items: List[Dict[str, str]],
  482 + enrichment_scopes: Optional[List[str]] = None,
  483 + category_taxonomy_profile: str = "apparel",
  484 +) -> List[Dict[str, Any]]:
444 485 """
445 486 同步执行内容理解,返回与 ES mapping 对齐的字段结构。
446 487 语言策略由 product_enrich 内部统一决定,路由层不参与。
447 488 """
448 489 from indexer.product_enrich import build_index_content_fields
449 490  
450   - results = build_index_content_fields(items=items, tenant_id=tenant_id)
  491 + results = build_index_content_fields(
  492 + items=items,
  493 + tenant_id=tenant_id,
  494 + enrichment_scopes=enrichment_scopes,
  495 + category_taxonomy_profile=category_taxonomy_profile,
  496 + )
451 497 return [
452 498 {
453 499 "spu_id": item["id"],
454 500 "qanchors": item["qanchors"],
455 501 "enriched_attributes": item["enriched_attributes"],
456 502 "enriched_tags": item["enriched_tags"],
  503 + "enriched_taxonomy_attributes": item["enriched_taxonomy_attributes"],
457 504 **({"error": item["error"]} if item.get("error") else {}),
458 505 }
459 506 for item in results
... ... @@ -463,15 +510,15 @@ def _run_enrich_content(tenant_id: str, items: List[Dict[str, str]]) -&gt; List[Dic
463 510 @router.post("/enrich-content")
464 511 async def enrich_content(request: EnrichContentRequest):
465 512 """
466   - 内容理解字段生成接口:根据商品标题批量生成 qanchors、enriched_attributes、tags
  513 + 内容理解字段生成接口:根据商品标题批量生成通用增强字段与品类 taxonomy 字段
467 514  
468 515 使用场景:
469 516 - 外部 indexer 采用「微服务组合」方式自己组织 doc 时,可调用本接口获取 LLM 生成的
470 517 锚文本与语义属性,再与翻译、向量化结果合并写入 ES。
471 518 - 与 /indexer/build-docs 解耦,避免 build-docs 因 LLM 耗时过长而阻塞;调用方可
472   - 先拿不含 qanchors/enriched_tags 的 doc,再异步或离线补齐本接口结果后更新 ES。
  519 + 先拿不含 qanchors/enriched_tags/taxonomy attributes 的 doc,再异步或离线补齐本接口结果后更新 ES。
473 520  
474   - 实现逻辑与 indexer.product_enrich.analyze_products 一致,支持多语言与 Redis 缓存。
  521 + 实现逻辑与 indexer.product_enrich.build_index_content_fields 一致,支持多语言与 Redis 缓存。
475 522 """
476 523 try:
477 524 if not request.items:
... ... @@ -493,15 +540,20 @@ async def enrich_content(request: EnrichContentRequest):
493 540 for it in request.items
494 541 ]
495 542 loop = asyncio.get_event_loop()
  543 + enrichment_scopes = request.resolved_enrichment_scopes()
496 544 result = await loop.run_in_executor(
497 545 None,
498 546 lambda: _run_enrich_content(
499 547 tenant_id=request.tenant_id,
500   - items=items_payload
  548 + items=items_payload,
  549 + enrichment_scopes=enrichment_scopes,
  550 + category_taxonomy_profile=request.category_taxonomy_profile,
501 551 ),
502 552 )
503 553 return {
504 554 "tenant_id": request.tenant_id,
  555 + "enrichment_scopes": enrichment_scopes,
  556 + "category_taxonomy_profile": request.category_taxonomy_profile,
505 557 "results": result,
506 558 "total": len(result),
507 559 }
... ...
api/translator_app.py
... ... @@ -271,16 +271,20 @@ async def lifespan(_: FastAPI):
271 271 """Initialize all enabled translation backends on process startup."""
272 272 logger.info("Starting Translation Service API")
273 273 service = get_translation_service()
  274 + failed_models = list(getattr(service, "failed_models", []))
  275 + backend_errors = dict(getattr(service, "backend_errors", {}))
274 276 logger.info(
275   - "Translation service ready | default_model=%s default_scene=%s available_models=%s loaded_models=%s",
  277 + "Translation service ready | default_model=%s default_scene=%s available_models=%s loaded_models=%s failed_models=%s",
276 278 service.config["default_model"],
277 279 service.config["default_scene"],
278 280 service.available_models,
279 281 service.loaded_models,
  282 + failed_models,
280 283 )
281 284 logger.info(
282   - "Translation backends initialized on startup | models=%s",
  285 + "Translation backends initialized on startup | loaded=%s failed=%s",
283 286 service.loaded_models,
  287 + backend_errors,
284 288 )
285 289 verbose_logger.info(
286 290 "Translation startup detail | capabilities=%s cache_ttl_seconds=%s cache_sliding_expiration=%s",
... ... @@ -316,11 +320,14 @@ async def health_check():
316 320 """Health check endpoint."""
317 321 try:
318 322 service = get_translation_service()
  323 + failed_models = list(getattr(service, "failed_models", []))
  324 + backend_errors = dict(getattr(service, "backend_errors", {}))
319 325 logger.info(
320   - "Health check | default_model=%s default_scene=%s loaded_models=%s",
  326 + "Health check | default_model=%s default_scene=%s loaded_models=%s failed_models=%s",
321 327 service.config["default_model"],
322 328 service.config["default_scene"],
323 329 service.loaded_models,
  330 + failed_models,
324 331 )
325 332 return {
326 333 "status": "healthy",
... ... @@ -330,6 +337,8 @@ async def health_check():
330 337 "available_models": service.available_models,
331 338 "enabled_capabilities": get_enabled_translation_models(service.config),
332 339 "loaded_models": service.loaded_models,
  340 + "failed_models": failed_models,
  341 + "backend_errors": backend_errors,
333 342 }
334 343 except Exception as e:
335 344 logger.error(f"Health check failed: {e}")
... ... @@ -463,6 +472,10 @@ async def translate(request: TranslationRequest, http_request: Request):
463 472 latency_ms = (time.perf_counter() - request_started) * 1000
464 473 logger.warning("Translation validation error | error=%s latency_ms=%.2f", e, latency_ms)
465 474 raise HTTPException(status_code=400, detail=str(e)) from e
  475 + except RuntimeError as e:
  476 + latency_ms = (time.perf_counter() - request_started) * 1000
  477 + logger.warning("Translation backend unavailable | error=%s latency_ms=%.2f", e, latency_ms)
  478 + raise HTTPException(status_code=503, detail=str(e)) from e
466 479 except Exception as e:
467 480 latency_ms = (time.perf_counter() - request_started) * 1000
468 481 logger.error("Translation error | error=%s latency_ms=%.2f", e, latency_ms, exc_info=True)
... ...
benchmarks/README.md 0 → 100644
... ... @@ -0,0 +1,17 @@
  1 +# Benchmarks
  2 +
  3 +基准压测脚本统一放在 `benchmarks/`,不再和 `scripts/` 里的服务启动/运维脚本混放。
  4 +
  5 +目录约定:
  6 +
  7 +- `benchmarks/perf_api_benchmark.py`:通用 HTTP 接口压测入口
  8 +- `benchmarks/reranker/`:reranker 定向 benchmark、smoke、手工对比脚本
  9 +- `benchmarks/translation/`:translation 本地模型 benchmark
  10 +
  11 +这些脚本默认不是 CI 测试的一部分,因为它们通常具备以下特征:
  12 +
  13 +- 依赖真实服务、GPU、模型或特定数据集
  14 +- 结果受机器配置和运行时负载影响,不适合作为稳定回归门禁
  15 +- 更多用于容量评估、调参和问题复现,而不是功能正确性判定
  16 +
  17 +如果某个性能场景需要进入自动化回归,应新增到 `tests/` 下并明确收敛输入、环境和判定阈值,而不是直接复用这里的基准脚本。
... ...
scripts/perf_api_benchmark.py renamed to benchmarks/perf_api_benchmark.py
... ... @@ -11,13 +11,13 @@ Default scenarios (aligned with docs/搜索API对接指南 分册,如 -01 / -0
11 11 - rerank POST /rerank
12 12  
13 13 Examples:
14   - python scripts/perf_api_benchmark.py --scenario backend_search --duration 30 --concurrency 20 --tenant-id 162
15   - python scripts/perf_api_benchmark.py --scenario backend_suggest --duration 30 --concurrency 50 --tenant-id 162
16   - python scripts/perf_api_benchmark.py --scenario all --duration 60 --concurrency 80 --tenant-id 162
17   - python scripts/perf_api_benchmark.py --scenario all --cases-file scripts/perf_cases.json.example --output perf_result.json
  14 + python benchmarks/perf_api_benchmark.py --scenario backend_search --duration 30 --concurrency 20 --tenant-id 162
  15 + python benchmarks/perf_api_benchmark.py --scenario backend_suggest --duration 30 --concurrency 50 --tenant-id 162
  16 + python benchmarks/perf_api_benchmark.py --scenario all --duration 60 --concurrency 80 --tenant-id 162
  17 + python benchmarks/perf_api_benchmark.py --scenario all --cases-file benchmarks/perf_cases.json.example --output perf_result.json
18 18 # Embedding admission / priority (query param `priority`; same semantics as embedding service):
19   - python scripts/perf_api_benchmark.py --scenario embed_text --embed-text-priority 1 --duration 30 --concurrency 20
20   - python scripts/perf_api_benchmark.py --scenario embed_image --embed-image-priority 1 --duration 30 --concurrency 10
  19 + python benchmarks/perf_api_benchmark.py --scenario embed_text --embed-text-priority 1 --duration 30 --concurrency 20
  20 + python benchmarks/perf_api_benchmark.py --scenario embed_image --embed-image-priority 1 --duration 30 --concurrency 10
21 21 """
22 22  
23 23 from __future__ import annotations
... ... @@ -229,7 +229,7 @@ def apply_embed_priority_params(
229 229 ) -> None:
230 230 """
231 231 Merge default `priority` query param into embed templates when absent.
232   - `scripts/perf_cases.json` may set per-request `params.priority` to override.
  232 + `benchmarks/perf_cases.json` may set per-request `params.priority` to override.
233 233 """
234 234 mapping = {
235 235 "embed_text": max(0, int(embed_text_priority)),
... ...
scripts/perf_cases.json.example renamed to benchmarks/perf_cases.json.example
scripts/benchmark_reranker_1000docs.sh renamed to benchmarks/reranker/benchmark_reranker_1000docs.sh
... ... @@ -8,7 +8,7 @@
8 8 # Outputs JSON reports under perf_reports/<date>/reranker_1000docs/
9 9 #
10 10 # Usage:
11   -# ./scripts/benchmark_reranker_1000docs.sh
  11 +# ./benchmarks/reranker/benchmark_reranker_1000docs.sh
12 12 # Optional env:
13 13 # BATCH_SIZES="24 32 48 64"
14 14 # C1_REQUESTS=4
... ... @@ -85,7 +85,7 @@ run_bench() {
85 85 local c="$2"
86 86 local req="$3"
87 87 local out="${OUT_DIR}/rerank_bs${bs}_c${c}_r${req}.json"
88   - .venv/bin/python scripts/perf_api_benchmark.py \
  88 + .venv/bin/python benchmarks/perf_api_benchmark.py \
89 89 --scenario rerank \
90 90 --tenant-id "${TENANT_ID}" \
91 91 --reranker-base "${RERANK_BASE}" \
... ...
scripts/benchmark_reranker_gguf_local.py renamed to benchmarks/reranker/benchmark_reranker_gguf_local.py
... ... @@ -8,8 +8,8 @@ Runs the backend directly in a fresh process per config to measure:
8 8 - single-request rerank latency
9 9  
10 10 Example:
11   - ./.venv-reranker-gguf/bin/python scripts/benchmark_reranker_gguf_local.py
12   - ./.venv-reranker-gguf-06b/bin/python scripts/benchmark_reranker_gguf_local.py --backend-name qwen3_gguf_06b --docs 400
  11 + ./.venv-reranker-gguf/bin/python benchmarks/reranker/benchmark_reranker_gguf_local.py
  12 + ./.venv-reranker-gguf-06b/bin/python benchmarks/reranker/benchmark_reranker_gguf_local.py --backend-name qwen3_gguf_06b --docs 400
13 13 """
14 14  
15 15 from __future__ import annotations
... ...
scripts/benchmark_reranker_random_titles.py renamed to benchmarks/reranker/benchmark_reranker_random_titles.py
... ... @@ -10,10 +10,10 @@ Each invocation runs 3 warmup requests with n=400 first; those are not timed for
10 10  
11 11 Example:
12 12 source activate.sh
13   - python scripts/benchmark_reranker_random_titles.py 386
14   - python scripts/benchmark_reranker_random_titles.py 40,80,100
15   - python scripts/benchmark_reranker_random_titles.py 40,80,100 --repeat 3 --seed 42
16   - RERANK_BASE=http://127.0.0.1:6007 python scripts/benchmark_reranker_random_titles.py 200
  13 + python benchmarks/reranker/benchmark_reranker_random_titles.py 386
  14 + python benchmarks/reranker/benchmark_reranker_random_titles.py 40,80,100
  15 + python benchmarks/reranker/benchmark_reranker_random_titles.py 40,80,100 --repeat 3 --seed 42
  16 + RERANK_BASE=http://127.0.0.1:6007 python benchmarks/reranker/benchmark_reranker_random_titles.py 200
17 17 """
18 18  
19 19 from __future__ import annotations
... ...
tests/reranker_performance/curl1.sh renamed to benchmarks/reranker/manual/curl1.sh
tests/reranker_performance/curl1_simple.sh renamed to benchmarks/reranker/manual/curl1_simple.sh
tests/reranker_performance/curl2.sh renamed to benchmarks/reranker/manual/curl2.sh
tests/reranker_performance/rerank_performance_compare.sh renamed to benchmarks/reranker/manual/rerank_performance_compare.sh
scripts/patch_rerank_vllm_benchmark_config.py renamed to benchmarks/reranker/patch_rerank_vllm_benchmark_config.py
... ... @@ -73,7 +73,7 @@ def main() -&gt; int:
73 73 p.add_argument(
74 74 "--config",
75 75 type=Path,
76   - default=Path(__file__).resolve().parent.parent / "config" / "config.yaml",
  76 + default=Path(__file__).resolve().parents[2] / "config" / "config.yaml",
77 77 )
78 78 p.add_argument("--backend", choices=("qwen3_vllm", "qwen3_vllm_score"), required=True)
79 79 p.add_argument(
... ...
scripts/run_reranker_vllm_instruction_benchmark.sh renamed to benchmarks/reranker/run_reranker_vllm_instruction_benchmark.sh
... ... @@ -55,13 +55,13 @@ run_one() {
55 55 local jf="${OUT_DIR}/${backend}_${fmt}.json"
56 56  
57 57 echo "========== ${tag} =========="
58   - "$PYTHON" "${ROOT}/scripts/patch_rerank_vllm_benchmark_config.py" \
  58 + "$PYTHON" "${ROOT}/benchmarks/reranker/patch_rerank_vllm_benchmark_config.py" \
59 59 --backend "$backend" --instruction-format "$fmt"
60 60  
61 61 "${ROOT}/restart.sh" reranker
62 62 wait_health "$backend" "$fmt"
63 63  
64   - if ! "$PYTHON" "${ROOT}/scripts/benchmark_reranker_random_titles.py" \
  64 + if ! "$PYTHON" "${ROOT}/benchmarks/reranker/benchmark_reranker_random_titles.py" \
65 65 100,200,400,600,800,1000 \
66 66 --repeat 5 \
67 67 --seed 42 \
... ... @@ -82,7 +82,7 @@ run_one qwen3_vllm_score compact
82 82 run_one qwen3_vllm_score standard
83 83  
84 84 # Restore repo-default-style rerank settings (score + compact).
85   -"$PYTHON" "${ROOT}/scripts/patch_rerank_vllm_benchmark_config.py" \
  85 +"$PYTHON" "${ROOT}/benchmarks/reranker/patch_rerank_vllm_benchmark_config.py" \
86 86 --backend qwen3_vllm_score --instruction-format compact
87 87 "${ROOT}/restart.sh" reranker
88 88 wait_health qwen3_vllm_score compact
... ...
scripts/smoke_qwen3_vllm_score_backend.py renamed to benchmarks/reranker/smoke_qwen3_vllm_score_backend.py
... ... @@ -3,7 +3,7 @@
3 3 Smoke test: load Qwen3VLLMScoreRerankerBackend (must run as a file, not stdin — vLLM spawn).
4 4  
5 5 Usage (from repo root, score venv):
6   - PYTHONPATH=. ./.venv-reranker-score/bin/python scripts/smoke_qwen3_vllm_score_backend.py
  6 + PYTHONPATH=. ./.venv-reranker-score/bin/python benchmarks/reranker/smoke_qwen3_vllm_score_backend.py
7 7  
8 8 Same as production: vLLM child processes need the venv's ``bin`` on PATH (for pip's ``ninja`` when
9 9 vLLM auto-selects FLASHINFER on T4/Turing). ``start_reranker.sh`` exports that; this script prepends
... ... @@ -20,8 +20,8 @@ import sys
20 20 import sysconfig
21 21 from pathlib import Path
22 22  
23   -# Repo root on sys.path when run as scripts/smoke_*.py
24   -_ROOT = Path(__file__).resolve().parents[1]
  23 +# Repo root on sys.path when run from benchmarks/reranker/.
  24 +_ROOT = Path(__file__).resolve().parents[2]
25 25 if str(_ROOT) not in sys.path:
26 26 sys.path.insert(0, str(_ROOT))
27 27  
... ...
scripts/benchmark_nllb_t4_tuning.py renamed to benchmarks/translation/benchmark_nllb_t4_tuning.py
... ... @@ -11,12 +11,12 @@ from datetime import datetime
11 11 from pathlib import Path
12 12 from typing import Any, Dict, List, Tuple
13 13  
14   -PROJECT_ROOT = Path(__file__).resolve().parent.parent
  14 +PROJECT_ROOT = Path(__file__).resolve().parents[2]
15 15 if str(PROJECT_ROOT) not in sys.path:
16 16 sys.path.insert(0, str(PROJECT_ROOT))
17 17  
18 18 from config.services_config import get_translation_config
19   -from scripts.benchmark_translation_local_models import (
  19 +from benchmarks.translation.benchmark_translation_local_models import (
20 20 benchmark_concurrency_case,
21 21 benchmark_serial_case,
22 22 build_environment_info,
... ...
scripts/benchmark_translation_local_models.py renamed to benchmarks/translation/benchmark_translation_local_models.py
... ... @@ -22,7 +22,7 @@ from typing import Any, Dict, Iterable, List, Sequence
22 22 import torch
23 23 import transformers
24 24  
25   -PROJECT_ROOT = Path(__file__).resolve().parent.parent
  25 +PROJECT_ROOT = Path(__file__).resolve().parents[2]
26 26 if str(PROJECT_ROOT) not in sys.path:
27 27 sys.path.insert(0, str(PROJECT_ROOT))
28 28  
... ...
scripts/benchmark_translation_local_models_focus.py renamed to benchmarks/translation/benchmark_translation_local_models_focus.py
... ... @@ -11,12 +11,12 @@ from datetime import datetime
11 11 from pathlib import Path
12 12 from typing import Any, Dict, List
13 13  
14   -PROJECT_ROOT = Path(__file__).resolve().parent.parent
  14 +PROJECT_ROOT = Path(__file__).resolve().parents[2]
15 15 if str(PROJECT_ROOT) not in sys.path:
16 16 sys.path.insert(0, str(PROJECT_ROOT))
17 17  
18 18 from config.services_config import get_translation_config
19   -from scripts.benchmark_translation_local_models import (
  19 +from benchmarks.translation.benchmark_translation_local_models import (
20 20 SCENARIOS,
21 21 benchmark_concurrency_case,
22 22 benchmark_serial_case,
... ...
scripts/benchmark_translation_longtext_single.py renamed to benchmarks/translation/benchmark_translation_longtext_single.py
... ... @@ -13,7 +13,7 @@ from pathlib import Path
13 13  
14 14 import torch
15 15  
16   -PROJECT_ROOT = Path(__file__).resolve().parent.parent
  16 +PROJECT_ROOT = Path(__file__).resolve().parents[2]
17 17  
18 18 import sys
19 19  
... ...
config/config.yaml
1   -# Unified Configuration for Multi-Tenant Search Engine
2   -# 统一配置文件,所有租户共用一套配置
3   -# 注意:索引结构由 mappings/search_products.json 定义,此文件只配置搜索行为
4   -#
5   -# 约定:下列键为必填;进程环境变量可覆盖 infrastructure / runtime 中同名语义项
6   -#(如 ES_HOST、API_PORT 等),未设置环境变量时使用本文件中的值。
7   -
8   -# Process / bind addresses (环境变量 APP_ENV、RUNTIME_ENV、ES_INDEX_NAMESPACE 可覆盖前两者的语义)
9 1 runtime:
10 2 environment: prod
11 3 index_namespace: ''
... ... @@ -21,8 +13,6 @@ runtime:
21 13 translator_port: 6006
22 14 reranker_host: 0.0.0.0
23 15 reranker_port: 6007
24   -
25   -# 基础设施连接(敏感项优先读环境变量:ES_*、REDIS_*、DB_*、DASHSCOPE_API_KEY、DEEPL_AUTH_KEY)
26 16 infrastructure:
27 17 elasticsearch:
28 18 host: http://localhost:9200
... ... @@ -49,23 +39,12 @@ infrastructure:
49 39 secrets:
50 40 dashscope_api_key: null
51 41 deepl_auth_key: null
52   -
53   -# Elasticsearch Index
54 42 es_index_name: search_products
55   -
56   -# 检索域 / 索引列表(可为空列表;每项字段均需显式给出)
57 43 indexes: []
58   -
59   -# Config assets
60 44 assets:
61 45 query_rewrite_dictionary_path: config/dictionaries/query_rewrite.dict
62   -
63   -# Product content understanding (LLM enrich-content) configuration
64 46 product_enrich:
65 47 max_workers: 40
66   -
67   -# 离线 / Web 相关性评估(scripts/evaluation、eval-web)
68   -# CLI 未显式传参时使用此处默认值;search_base_url 未配置时自动为 http://127.0.0.1:{runtime.api_port}
69 48 search_evaluation:
70 49 artifact_root: artifacts/search_evaluation
71 50 queries_file: scripts/evaluation/queries/queries.txt
... ... @@ -74,10 +53,10 @@ search_evaluation:
74 53 search_base_url: ''
75 54 web_host: 0.0.0.0
76 55 web_port: 6010
77   - judge_model: qwen3.5-plus
  56 + judge_model: qwen3.6-plus
78 57 judge_enable_thinking: false
79 58 judge_dashscope_batch: false
80   - intent_model: qwen3-max
  59 + intent_model: qwen3.6-plus
81 60 intent_enable_thinking: true
82 61 judge_batch_completion_window: 24h
83 62 judge_batch_poll_interval_sec: 10.0
... ... @@ -98,20 +77,17 @@ search_evaluation:
98 77 rebuild_irrelevant_stop_ratio: 0.799
99 78 rebuild_irrel_low_combined_stop_ratio: 0.959
100 79 rebuild_irrelevant_stop_streak: 3
101   -
102   -# ES Index Settings (基础设置)
103 80 es_settings:
104 81 number_of_shards: 1
105 82 number_of_replicas: 0
106 83 refresh_interval: 30s
107 84  
108   -# 字段权重配置(用于搜索时的字段boost)
109   -# 统一按“字段基名”配置;查询时按实际检索语言动态拼接 .{lang}。
110   -# 若需要按某个语言单独调权,也可以加显式 key(例如 title.de: 3.2)。
  85 +# 统一按“字段基名”配置;查询时按实际检索语言动态拼接 .{lang}
111 86 field_boosts:
112 87 title: 3.0
113   - qanchors: 1.8
114   - enriched_tags: 1.8
  88 + # qanchors enriched_tags 在 enriched_attributes.value中也存在,所以其实他的权重为自身权重+enriched_attributes.value的权重
  89 + qanchors: 1.0
  90 + enriched_tags: 1.0
115 91 enriched_attributes.value: 1.5
116 92 category_name_text: 2.0
117 93 category_path: 2.0
... ... @@ -124,38 +100,25 @@ field_boosts:
124 100 description: 1.0
125 101 vendor: 1.0
126 102  
127   -# Query Configuration(查询配置)
128 103 query_config:
129   - # 支持的语言
130 104 supported_languages:
131 105 - zh
132 106 - en
133 107 default_language: en
134   -
135   - # 功能开关(翻译开关由tenant_config控制)
136 108 enable_text_embedding: true
137 109 enable_query_rewrite: true
138 110  
139   - # 查询翻译模型(须与 services.translation.capabilities 中某项一致)
140   - # 源语种在租户 index_languages 内:主召回可打在源语种字段,用下面三项。
141   - zh_to_en_model: nllb-200-distilled-600m # "opus-mt-zh-en"
142   - en_to_zh_model: nllb-200-distilled-600m # "opus-mt-en-zh"
143   - default_translation_model: nllb-200-distilled-600m
144   - # zh_to_en_model: deepl
145   - # en_to_zh_model: deepl
146   - # default_translation_model: deepl
147   - # 源语种不在 index_languages:翻译对可检索文本更关键,可单独指定(缺省则与上一组相同)
148   - zh_to_en_model__source_not_in_index: nllb-200-distilled-600m
149   - en_to_zh_model__source_not_in_index: nllb-200-distilled-600m
150   - default_translation_model__source_not_in_index: nllb-200-distilled-600m
151   - # zh_to_en_model__source_not_in_index: deepl
152   - # en_to_zh_model__source_not_in_index: deepl
153   - # default_translation_model__source_not_in_index: deepl
  111 + zh_to_en_model: deepl # nllb-200-distilled-600m
  112 + en_to_zh_model: deepl
  113 + default_translation_model: deepl
  114 + # 源语种不在 index_languages时翻译质量比较重要,因此单独配置
  115 + zh_to_en_model__source_not_in_index: deepl
  116 + en_to_zh_model__source_not_in_index: deepl
  117 + default_translation_model__source_not_in_index: deepl
154 118  
155   - # 查询解析阶段:翻译与 query 向量并发执行,共用同一等待预算(毫秒)。
156   - # 检测语言已在租户 index_languages 内:较短;不在索引语言内:较长(翻译对召回更关键)。
157   - translation_embedding_wait_budget_ms_source_in_index: 300 # 80
158   - translation_embedding_wait_budget_ms_source_not_in_index: 400 # 200
  119 + # 查询解析阶段:翻译与 query 向量并发执行,共用同一等待预算(毫秒)
  120 + translation_embedding_wait_budget_ms_source_in_index: 300
  121 + translation_embedding_wait_budget_ms_source_not_in_index: 400
159 122 style_intent:
160 123 enabled: true
161 124 selected_sku_boost: 1.2
... ... @@ -182,17 +145,15 @@ query_config:
182 145 product_title_exclusion:
183 146 enabled: true
184 147 dictionary_path: config/dictionaries/product_title_exclusion.tsv
185   -
186   - # 动态多语言检索字段配置
187   - # multilingual_fields 会被拼成 title.{lang}/brief.{lang}/... 形式;
188   - # shared_fields 为无语言后缀字段。
189 148 search_fields:
  149 + # 统一按“字段基名”配置;查询时按实际检索语言动态拼接 .{lang}
190 150 multilingual_fields:
191 151 - title
192 152 - keywords
193 153 - qanchors
194 154 - enriched_tags
195 155 - enriched_attributes.value
  156 + # - enriched_taxonomy_attributes.value
196 157 - option1_values
197 158 - option2_values
198 159 - option3_values
... ... @@ -202,13 +163,14 @@ query_config:
202 163 # - description
203 164 # - vendor
204 165 # shared_fields: 无语言后缀字段;示例: tags, option1_values, option2_values, option3_values
  166 +
205 167 shared_fields: null
206 168 core_multilingual_fields:
207 169 - title
208 170 - qanchors
209 171 - category_name_text
210 172  
211   - # 统一文本召回策略(主查询 + 翻译查询)
  173 + # 文本召回(主查询 + 翻译查询)
212 174 text_query_strategy:
213 175 base_minimum_should_match: 60%
214 176 translation_minimum_should_match: 60%
... ... @@ -223,14 +185,10 @@ query_config:
223 185 title: 5.0
224 186 qanchors: 4.0
225 187 phrase_match_boost: 3.0
226   -
227   - # Embedding字段名称
228 188 text_embedding_field: title_embedding
229 189 image_embedding_field: image_embedding.vector
230 190  
231   - # 返回字段配置(_source includes)
232   - # null表示返回所有字段,[]表示不返回任何字段,列表表示只返回指定字段
233   - # 下列字段与 api/result_formatter.py(SpuResult 填充)及 search/searcher.py(SKU 排序/主图替换)一致
  191 + # null表示返回所有字段,[]表示不返回任何字段
234 192 source_fields:
235 193 - spu_id
236 194 - handle
... ... @@ -251,6 +209,8 @@ query_config:
251 209 # - qanchors
252 210 # - enriched_tags
253 211 # - enriched_attributes
  212 + # - # enriched_taxonomy_attributes.value
  213 +
254 214 - min_price
255 215 - compare_at_price
256 216 - image_url
... ... @@ -270,26 +230,21 @@ query_config:
270 230 # KNN:文本向量与多模态(图片)向量各自 boost 与召回(k / num_candidates)
271 231 knn_text_boost: 4
272 232 knn_image_boost: 4
273   -
274   - # knn_text_num_candidates = k * 3.4
275 233 knn_text_k: 160
276   - knn_text_num_candidates: 560
  234 + knn_text_num_candidates: 560 # k * 3.4
277 235 knn_text_k_long: 400
278 236 knn_text_num_candidates_long: 1200
279 237 knn_image_k: 400
280 238 knn_image_num_candidates: 1200
281 239  
282   -# Function Score配置(ES层打分规则)
283 240 function_score:
284 241 score_mode: sum
285 242 boost_mode: multiply
286 243 functions: []
287   -
288   -# 粗排配置(仅融合 ES 文本/向量信号,不调用模型)
289 244 coarse_rank:
290 245 enabled: true
291   - input_window: 700
292   - output_window: 240
  246 + input_window: 480
  247 + output_window: 160
293 248 fusion:
294 249 es_bias: 10.0
295 250 es_exponent: 0.05
... ... @@ -301,30 +256,29 @@ coarse_rank:
301 256 knn_text_weight: 1.0
302 257 knn_image_weight: 2.0
303 258 knn_tie_breaker: 0.3
304   - knn_bias: 0.6
305   - knn_exponent: 0.4
306   -
307   -# 精排配置(轻量 reranker)
  259 + knn_bias: 0.0
  260 + knn_exponent: 5.6
  261 + knn_text_exponent: 0.0
  262 + knn_image_exponent: 0.0
308 263 fine_rank:
309   - enabled: false
  264 + enabled: false # false 时保序透传
310 265 input_window: 160
311 266 output_window: 80
312 267 timeout_sec: 10.0
313 268 rerank_query_template: '{query}'
314 269 rerank_doc_template: '{title}'
315 270 service_profile: fine
316   -
317   -# 重排配置(provider/URL 在 services.rerank)
318 271 rerank:
319   - enabled: true
  272 + enabled: false # false 时保序透传
320 273 rerank_window: 160
  274 + exact_knn_rescore_enabled: true
  275 + exact_knn_rescore_window: 160
321 276 timeout_sec: 15.0
322 277 weight_es: 0.4
323 278 weight_ai: 0.6
324 279 rerank_query_template: '{query}'
325 280 rerank_doc_template: '{title}'
326 281 service_profile: default
327   -
328 282 # 乘法融合:fused = Π (max(score,0) + bias) ** exponent(es / rerank / fine / text / knn)
329 283 # 其中 knn_score 先做一层 dis_max:
330 284 # max(knn_text_weight * text_knn, knn_image_weight * image_knn)
... ... @@ -337,30 +291,28 @@ rerank:
337 291 fine_bias: 0.1
338 292 fine_exponent: 1.0
339 293 text_bias: 0.1
340   - text_exponent: 0.25
341 294 # base_query_trans_* 相对 base_query 的权重(见 search/rerank_client 中文本 dismax 融合)
  295 + text_exponent: 0.25
342 296 text_translation_weight: 0.8
343 297 knn_text_weight: 1.0
344 298 knn_image_weight: 2.0
345 299 knn_tie_breaker: 0.3
346   - knn_bias: 0.6
347   - knn_exponent: 0.4
  300 + knn_bias: 0.0
  301 + knn_exponent: 5.6
348 302  
349   -# 可扩展服务/provider 注册表(单一配置源)
350 303 services:
351 304 translation:
352 305 service_url: http://127.0.0.1:6006
353   - # default_model: nllb-200-distilled-600m
354 306 default_model: nllb-200-distilled-600m
355 307 default_scene: general
356 308 timeout_sec: 10.0
357 309 cache:
358 310 ttl_seconds: 62208000
359 311 sliding_expiration: true
360   - # When false, cache keys are exact-match per request model only (ignores model_quality_tiers for lookups).
361   - enable_model_quality_tier_cache: true
  312 + # When false, cache keys are exact-match per request model only (ignores model_quality_tiers for lookups)
362 313 # Higher tier = better quality. Multiple models may share one tier (同级).
363 314 # A request may reuse Redis keys from models with tier > A or tier == A (not from lower tiers).
  315 + enable_model_quality_tier_cache: true
364 316 model_quality_tiers:
365 317 deepl: 30
366 318 qwen-mt: 30
... ... @@ -454,13 +406,12 @@ services:
454 406 num_beams: 1
455 407 use_cache: true
456 408 embedding:
457   - provider: http # http
  409 + provider: http
458 410 providers:
459 411 http:
460 412 text_base_url: http://127.0.0.1:6005
461 413 image_base_url: http://127.0.0.1:6008
462   - # 服务内文本后端(embedding 进程启动时读取)
463   - backend: tei # tei | local_st
  414 + backend: tei
464 415 backends:
465 416 tei:
466 417 base_url: http://127.0.0.1:8080
... ... @@ -500,13 +451,13 @@ services:
500 451 request:
501 452 max_docs: 1000
502 453 normalize: true
503   - default_instance: default
504 454 # 命名实例:同一套 reranker 代码按实例名读取不同端口 / 后端 / runtime 目录。
  455 + default_instance: default
505 456 instances:
506 457 default:
507 458 host: 0.0.0.0
508 459 port: 6007
509   - backend: qwen3_vllm_score
  460 + backend: bge
510 461 runtime_dir: ./.runtime/reranker/default
511 462 fine:
512 463 host: 0.0.0.0
... ... @@ -543,6 +494,7 @@ services:
543 494 enforce_eager: false
544 495 infer_batch_size: 100
545 496 sort_by_doc_length: true
  497 +
546 498 # standard=_format_instruction__standard(固定 yes/no system);compact=_format_instruction(instruction 作 system 且 user 内重复 Instruct)
547 499 instruction_format: standard # compact standard
548 500 # instruction: "Given a query, score the product for relevance"
... ... @@ -556,6 +508,7 @@ services:
556 508 # instruction: "Rank products by query with category & style match prioritized"
557 509 # instruction: "Given a fashion shopping query, retrieve relevant products that answer the query"
558 510 instruction: rank products by given query
  511 +
559 512 # vLLM LLM.score()(跨编码打分)。独立高性能环境 .venv-reranker-score(vllm 0.18 固定版):./scripts/setup_reranker_venv.sh qwen3_vllm_score
560 513 # 与 qwen3_vllm 可共用同一 model_name / HF 缓存;venv 分离以便升级 vLLM 而不影响 generate 后端。
561 514 qwen3_vllm_score:
... ... @@ -583,15 +536,10 @@ services:
583 536 qwen3_transformers:
584 537 model_name: Qwen/Qwen3-Reranker-0.6B
585 538 instruction: rank products by given query
586   - # instruction: "Score the product’s relevance to the given query"
587 539 max_length: 8192
588 540 batch_size: 64
589 541 use_fp16: true
590   - # sdpa:默认无需 flash-attn;若已安装 flash_attn 可改为 flash_attention_2
591 542 attn_implementation: sdpa
592   - # Packed Transformers backend: shared query prefix + custom position_ids/attention_mask.
593   - # For 1 query + many short docs (for example 400 product titles), this usually reduces
594   - # repeated prefix work and padding waste compared with pairwise batching.
595 543 qwen3_transformers_packed:
596 544 model_name: Qwen/Qwen3-Reranker-0.6B
597 545 instruction: Rank products by query with category & style match prioritized
... ... @@ -600,8 +548,6 @@ services:
600 548 max_docs_per_pack: 0
601 549 use_fp16: true
602 550 sort_by_doc_length: true
603   - # Packed mode relies on a custom 4D attention mask. "eager" is the safest default.
604   - # If your torch/transformers stack validates it, you can benchmark "sdpa".
605 551 attn_implementation: eager
606 552 qwen3_gguf:
607 553 repo_id: DevQuasar/Qwen.Qwen3-Reranker-4B-GGUF
... ... @@ -609,7 +555,6 @@ services:
609 555 cache_dir: ./model_cache
610 556 local_dir: ./models/reranker/qwen3-reranker-4b-gguf
611 557 instruction: Rank products by query with category & style match prioritized
612   - # T4 16GB / 性能优先配置:全量层 offload,实测比保守配置明显更快
613 558 n_ctx: 512
614 559 n_batch: 512
615 560 n_ubatch: 512
... ... @@ -632,8 +577,6 @@ services:
632 577 cache_dir: ./model_cache
633 578 local_dir: ./models/reranker/qwen3-reranker-0.6b-q8_0-gguf
634 579 instruction: Rank products by query with category & style match prioritized
635   - # 0.6B GGUF / online rerank baseline:
636   - # 实测 400 titles 单请求约 265s,因此它更适合作为低显存功能后备,不适合在线低延迟主路由。
637 580 n_ctx: 256
638 581 n_batch: 256
639 582 n_ubatch: 256
... ... @@ -653,20 +596,15 @@ services:
653 596 verbose: false
654 597 dashscope_rerank:
655 598 model_name: qwen3-rerank
656   - # 按地域选择 endpoint:
657   - # 中国: https://dashscope.aliyuncs.com/compatible-api/v1/reranks
658   - # 新加坡: https://dashscope-intl.aliyuncs.com/compatible-api/v1/reranks
659   - # 美国: https://dashscope-us.aliyuncs.com/compatible-api/v1/reranks
660 599 endpoint: https://dashscope.aliyuncs.com/compatible-api/v1/reranks
661 600 api_key_env: RERANK_DASHSCOPE_API_KEY_CN
662 601 timeout_sec: 10.0
663   - top_n_cap: 0 # 0 表示 top_n=当前请求文档数;>0 则限制 top_n 上限
664   - batchsize: 64 # 0 关闭;>0 启用并发小包调度(top_n/top_n_cap 仍生效,分包后全局截断)
  602 + top_n_cap: 0 # 0 表示 top_n=当前请求文档数
  603 + batchsize: 64 # 0 关闭;>0 启用并发小包调度(top_n/top_n_cap 仍生效,分包后全局截断)
665 604 instruct: Given a shopping query, rank product titles by relevance
666 605 max_retries: 2
667 606 retry_backoff_sec: 0.2
668 607  
669   -# SPU配置(已启用,使用嵌套skus)
670 608 spu_config:
671 609 enabled: true
672 610 spu_field: spu_id
... ... @@ -678,7 +616,6 @@ spu_config:
678 616 - option2
679 617 - option3
680 618  
681   -# 租户配置(Tenant Configuration)
682 619 # 每个租户可配置主语言 primary_language 与索引语言 index_languages(主市场语言,商家可勾选)
683 620 # 默认 index_languages: [en, zh],可配置为任意 SOURCE_LANG_CODE_MAP.keys() 的子集
684 621 tenant_config:
... ...
config/loader.py
... ... @@ -587,6 +587,14 @@ class AppConfigLoader:
587 587 knn_tie_breaker=float(coarse_fusion_raw.get("knn_tie_breaker", 0.0)),
588 588 knn_bias=float(coarse_fusion_raw.get("knn_bias", 0.6)),
589 589 knn_exponent=float(coarse_fusion_raw.get("knn_exponent", 0.2)),
  590 + knn_text_bias=float(
  591 + coarse_fusion_raw.get("knn_text_bias", coarse_fusion_raw.get("knn_bias", 0.6))
  592 + ),
  593 + knn_text_exponent=float(coarse_fusion_raw.get("knn_text_exponent", 0.0)),
  594 + knn_image_bias=float(
  595 + coarse_fusion_raw.get("knn_image_bias", coarse_fusion_raw.get("knn_bias", 0.6))
  596 + ),
  597 + knn_image_exponent=float(coarse_fusion_raw.get("knn_image_exponent", 0.0)),
590 598 text_translation_weight=float(
591 599 coarse_fusion_raw.get("text_translation_weight", 0.8)
592 600 ),
... ... @@ -608,6 +616,12 @@ class AppConfigLoader:
608 616 rerank=RerankConfig(
609 617 enabled=bool(rerank_cfg.get("enabled", True)),
610 618 rerank_window=int(rerank_cfg.get("rerank_window", 384)),
  619 + exact_knn_rescore_enabled=bool(
  620 + rerank_cfg.get("exact_knn_rescore_enabled", False)
  621 + ),
  622 + exact_knn_rescore_window=int(
  623 + rerank_cfg.get("exact_knn_rescore_window", 0)
  624 + ),
611 625 timeout_sec=float(rerank_cfg.get("timeout_sec", 15.0)),
612 626 weight_es=float(rerank_cfg.get("weight_es", 0.4)),
613 627 weight_ai=float(rerank_cfg.get("weight_ai", 0.6)),
... ... @@ -630,6 +644,14 @@ class AppConfigLoader:
630 644 knn_tie_breaker=float(fusion_raw.get("knn_tie_breaker", 0.0)),
631 645 knn_bias=float(fusion_raw.get("knn_bias", 0.6)),
632 646 knn_exponent=float(fusion_raw.get("knn_exponent", 0.2)),
  647 + knn_text_bias=float(
  648 + fusion_raw.get("knn_text_bias", fusion_raw.get("knn_bias", 0.6))
  649 + ),
  650 + knn_text_exponent=float(fusion_raw.get("knn_text_exponent", 0.0)),
  651 + knn_image_bias=float(
  652 + fusion_raw.get("knn_image_bias", fusion_raw.get("knn_bias", 0.6))
  653 + ),
  654 + knn_image_exponent=float(fusion_raw.get("knn_image_exponent", 0.0)),
633 655 fine_bias=float(fusion_raw.get("fine_bias", 0.00001)),
634 656 fine_exponent=float(fusion_raw.get("fine_exponent", 1.0)),
635 657 text_translation_weight=float(
... ... @@ -655,6 +677,14 @@ class AppConfigLoader:
655 677  
656 678 translation_raw = raw.get("translation") if isinstance(raw.get("translation"), dict) else {}
657 679 normalized_translation = build_translation_config(translation_raw)
  680 + local_translation_backends = {"local_nllb", "local_marian"}
  681 + for capability_name, capability_cfg in normalized_translation["capabilities"].items():
  682 + backend_name = str(capability_cfg.get("backend") or "").strip().lower()
  683 + if backend_name not in local_translation_backends:
  684 + continue
  685 + for path_key in ("model_dir", "ct2_model_dir"):
  686 + if capability_cfg.get(path_key) not in (None, ""):
  687 + capability_cfg[path_key] = str(self._resolve_project_path_value(capability_cfg[path_key]).resolve())
658 688 translation_config = TranslationServiceConfig(
659 689 endpoint=str(normalized_translation["service_url"]).rstrip("/"),
660 690 timeout_sec=float(normalized_translation["timeout_sec"]),
... ... @@ -749,7 +779,7 @@ class AppConfigLoader:
749 779 port=port,
750 780 backend=backend_name,
751 781 runtime_dir=(
752   - str(v)
  782 + str(self._resolve_project_path_value(v).resolve())
753 783 if (v := instance_raw.get("runtime_dir")) not in (None, "")
754 784 else None
755 785 ),
... ... @@ -787,6 +817,12 @@ class AppConfigLoader:
787 817 rerank=rerank_config,
788 818 )
789 819  
  820 + def _resolve_project_path_value(self, value: Any) -> Path:
  821 + candidate = Path(str(value)).expanduser()
  822 + if candidate.is_absolute():
  823 + return candidate
  824 + return self.project_root / candidate
  825 +
790 826 def _build_tenants_config(self, raw: Dict[str, Any]) -> TenantCatalogConfig:
791 827 if not isinstance(raw, dict):
792 828 raise ConfigurationError("tenant_config must be a mapping")
... ...
config/schema.py
... ... @@ -119,6 +119,18 @@ class RerankFusionConfig:
119 119 knn_tie_breaker: float = 0.0
120 120 knn_bias: float = 0.6
121 121 knn_exponent: float = 0.2
  122 + #: Optional additive floor for the weighted text KNN term.
  123 + #: Falls back to knn_bias when omitted in config loading.
  124 + knn_text_bias: float = 0.6
  125 + #: Optional extra multiplicative term on weighted text KNN.
  126 + #: Uses knn_text_bias as the additive floor.
  127 + knn_text_exponent: float = 0.0
  128 + #: Optional additive floor for the weighted image KNN term.
  129 + #: Falls back to knn_bias when omitted in config loading.
  130 + knn_image_bias: float = 0.6
  131 + #: Optional extra multiplicative term on weighted image KNN.
  132 + #: Uses knn_image_bias as the additive floor.
  133 + knn_image_exponent: float = 0.0
122 134 fine_bias: float = 0.00001
123 135 fine_exponent: float = 1.0
124 136 #: 翻译子句 named query 分数相对原文 base_query 的权重(加权后再与原文做 dismax 融合)
... ... @@ -143,6 +155,18 @@ class CoarseRankFusionConfig:
143 155 knn_tie_breaker: float = 0.0
144 156 knn_bias: float = 0.6
145 157 knn_exponent: float = 0.2
  158 + #: Optional additive floor for the weighted text KNN term.
  159 + #: Falls back to knn_bias when omitted in config loading.
  160 + knn_text_bias: float = 0.6
  161 + #: Optional extra multiplicative term on weighted text KNN.
  162 + #: Uses knn_text_bias as the additive floor.
  163 + knn_text_exponent: float = 0.0
  164 + #: Optional additive floor for the weighted image KNN term.
  165 + #: Falls back to knn_bias when omitted in config loading.
  166 + knn_image_bias: float = 0.6
  167 + #: Optional extra multiplicative term on weighted image KNN.
  168 + #: Uses knn_image_bias as the additive floor.
  169 + knn_image_exponent: float = 0.0
146 170 #: 翻译子句 named query 分数相对原文 base_query 的权重(加权后再与原文做 dismax 融合)
147 171 text_translation_weight: float = 0.8
148 172  
... ... @@ -176,6 +200,9 @@ class RerankConfig:
176 200  
177 201 enabled: bool = True
178 202 rerank_window: int = 384
  203 + exact_knn_rescore_enabled: bool = False
  204 + #: topN exact vector scoring window; <=0 means "follow rerank_window"
  205 + exact_knn_rescore_window: int = 0
179 206 timeout_sec: float = 15.0
180 207 weight_es: float = 0.4
181 208 weight_ai: float = 0.6
... ...
docs/DEVELOPER_GUIDE.md
... ... @@ -389,7 +389,7 @@ services:
389 389 - **位置**:`tests/`,可按 `unit/`、`integration/` 或按模块划分子目录;公共 fixture 在 `conftest.py`。
390 390 - **标记**:使用 `@pytest.mark.unit`、`@pytest.mark.integration`、`@pytest.mark.api` 等区分用例类型,便于按需运行。
391 391 - **依赖**:单元测试通过 mock(如 `mock_es_client`、`sample_search_config`)不依赖真实 ES/DB;集成测试需在说明中注明依赖服务。
392   -- **运行**:`python -m pytest tests/`;仅单元:`python -m pytest tests/unit/` 或 `-m unit`
  392 +- **运行**:`python -m pytest tests/`;推荐最小回归:`python -m pytest tests/ci -q`;按模块聚焦可直接指定具体测试文件
393 393 - **原则**:新增逻辑应有对应测试;修改协议或配置契约时更新相关测试与 fixture。
394 394  
395 395 ### 8.3 配置与环境
... ...
docs/QUICKSTART.md
... ... @@ -69,7 +69,7 @@ source activate.sh
69 69 ./run.sh all
70 70 # 仅为薄封装:等价于 ./scripts/service_ctl.sh up all
71 71 # 说明:
72   -# - all = tei cnclip embedding embedding-image translator reranker reranker-fine backend indexer frontend eval-web
  72 +# - all = tei cnclip embedding embedding-image translator reranker backend indexer frontend eval-web
73 73 # - up 会同时启动 monitor daemon(运行期连续失败自动重启)
74 74 # - reranker 为 GPU 强制模式(资源不足会直接启动失败)
75 75 # - TEI 默认使用 GPU;当 TEI_DEVICE=cuda 且 GPU 不可用时会直接失败(不会自动降级到 CPU)
... ... @@ -166,7 +166,7 @@ curl -X POST http://localhost:6008/embed/image \
166 166  
167 167 ```bash
168 168 ./scripts/setup_translator_venv.sh
169   -./.venv-translator/bin/python scripts/download_translation_models.py --all-local # 如需本地模型
  169 +./.venv-translator/bin/python scripts/translation/download_translation_models.py --all-local # 如需本地模型
170 170 ./scripts/start_translator.sh
171 171  
172 172 curl -X POST http://localhost:6006/translate \
... ...
docs/Usage-Guide.md
... ... @@ -126,7 +126,7 @@ cd /data/saas-search
126 126  
127 127 这个脚本会自动:
128 128 1. 创建日志目录
129   -2. 按目标启动服务(`all`:`tei cnclip embedding embedding-image translator reranker reranker-fine backend indexer frontend eval-web`)
  129 +2. 按目标启动服务(`all`:`tei cnclip embedding embedding-image translator reranker backend indexer frontend eval-web`)
130 130 3. 写入 PID 到 `logs/*.pid`
131 131 4. 执行健康检查
132 132 5. 启动 monitor daemon(运行期连续失败自动重启)
... ... @@ -202,7 +202,7 @@ python -m pytest -q tests/test_rerank_client.py tests/test_es_query_builder.py t
202 202 ./scripts/service_ctl.sh restart backend
203 203 sleep 3
204 204 ./scripts/service_ctl.sh status backend
205   -./scripts/evaluation/start_eval.sh.sh batch
  205 +./scripts/evaluation/start_eval.sh batch
206 206 ```
207 207  
208 208 离线批量评估会把标注与报表写到 `artifacts/search_evaluation/`(SQLite、`batch_reports/` 下的 JSON/Markdown 等)。说明与命令见 [scripts/evaluation/README.md](../scripts/evaluation/README.md)。
... ...
docs/caches-inventory.md 0 → 100644
... ... @@ -0,0 +1,133 @@
  1 +# 本项目缓存一览
  2 +
  3 +本文档梳理仓库内**与业务相关的各类缓存**:说明用途、键与过期策略,并汇总运维脚本。按「分布式(Redis)→ 进程内 → 磁盘/模型 → 第三方」组织。
  4 +
  5 +---
  6 +
  7 +## 一、Redis 集中式缓存(生产主路径)
  8 +
  9 +所有下列缓存默认连接 **`infrastructure.redis`**(`config/config.yaml` 与 `REDIS_*` 环境变量),**数据库编号一般为 `db=0`**(脚本可通过参数覆盖)。`snapshot_db` 仅在配置中存在,供快照/运维场景选用,应用代码未按该字段切换业务缓存的 DB。
  10 +
  11 +### 1. 文本 / 图像向量缓存(Embedding)
  12 +
  13 +- **作用**:缓存 BGE/TEI 文本向量与 CN-CLIP 图像向量、CLIP 文本塔向量,避免重复推理。
  14 +- **实现**:`embeddings/redis_embedding_cache.py` 的 `RedisEmbeddingCache`;键构造见 `embeddings/cache_keys.py`。
  15 +- **Key 形态**(最终 Redis 键 = `前缀` + `可选 namespace` + `逻辑键`):
  16 + - **前缀**:`infrastructure.redis.embedding_cache_prefix`(默认 `embedding`,可用 `REDIS_EMBEDDING_CACHE_PREFIX` 覆盖)。
  17 + - **命名空间**:`embeddings/server.py` 与客户端中分为:
  18 + - 文本:`namespace=""` → `{prefix}:{embed:norm0|1:...}`
  19 + - 图像:`namespace="image"` → `{prefix}:image:{embed:模型名:txt:norm0|1:...}`
  20 + - CLIP 文本:`namespace="clip_text"` → `{prefix}:clip_text:{embed:模型名:img:norm0|1:...}`
  21 + - 逻辑键段含 `embed:`、`norm0/1`、模型名(多模态)、过长文本/URL 时用 `h:sha256:...` 摘要(见 `cache_keys.py` 注释)。
  22 +- **值格式**:BF16 压缩后的字节(`embeddings/bf16.py`),非 JSON。
  23 +- **TTL**:`infrastructure.redis.cache_expire_days`(默认 **720 天**,`REDIS_CACHE_EXPIRE_DAYS`)。写入用 `SETEX`;**命中时滑动续期**(`EXPIRE` 刷新为同一时长)。
  24 +- **Redis 客户端**:`decode_responses=False`(二进制)。
  25 +
  26 +**主要代码**:`embeddings/server.py`、`embeddings/text_encoder.py`、`embeddings/image_encoder.py`。
  27 +
  28 +---
  29 +
  30 +### 2. 翻译结果缓存(Translation)
  31 +
  32 +- **作用**:按「翻译模型 + 目标语言 + 原文」缓存译文;支持**模型质量分层探测**(高 tier 模型写入的缓存可被同 tier 或更高 tier 的请求命中,见 `translation/settings.py` 中 `translation_cache_probe_models`)。
  33 +- **Key 形态**:`trans:{model}:{target_lang}:{text前4字符}{sha256全文}`(`translation/cache.py` 的 `build_key`)。
  34 +- **值格式**:UTF-8 译文字符串。
  35 +- **TTL**:`services.translation.cache.ttl_seconds`(默认 **62208000 秒 = 720 天**)。若 `sliding_expiration: true`,命中时刷新 TTL。
  36 +- **能力级开关**:各 `capabilities.*.use_cache` 为 `false` 时该后端不落 Redis。
  37 +- **Redis 客户端**:`decode_responses=True`。
  38 +
  39 +**主要代码**:`translation/cache.py`、`translation/service.py`;翻译 HTTP 服务:`api/translator_app.py`(`get_translation_service()` 使用 `lru_cache` 单例,见下文进程内缓存)。
  40 +
  41 +---
  42 +
  43 +### 3. 商品内容理解 / Anchors 与语义分析缓存(Indexer)
  44 +
  45 +- **作用**:缓存 LLM 对商品标题等拼出的 **prompt 输入** 所做的分析结果(anchors、语义属性等),避免重复调用大模型。键与 `analysis_kind`、`prompt` 契约版本、`target_lang` 及输入摘要相关。
  46 +- **Key 形态**:`{anchor_cache_prefix}:{analysis_kind}:{prompt_contract_hash[:12]}:{target_lang}:{prompt_input[:4]}{md5}`(`indexer/product_enrich.py` 中 `_make_analysis_cache_key`)。
  47 +- **前缀**:`infrastructure.redis.anchor_cache_prefix`(默认 `product_anchors`,`REDIS_ANCHOR_CACHE_PREFIX`)。
  48 +- **值格式**:JSON 字符串(规范化后的分析结果)。
  49 +- **TTL**:`anchor_cache_expire_days`(默认 **30 天**),以秒写入 `SETEX`(**非滑动**,与向量/翻译不同)。
  50 +- **读逻辑**:无 TTL 刷新;仅校验内容是否「有意义」再返回。
  51 +
  52 +**主要代码**:`indexer/product_enrich.py`;与 HTTP 侧对齐说明见 `api/routes/indexer.py` 注释。
  53 +
  54 +---
  55 +
  56 +## 二、进程内缓存(非共享、随进程重启失效)
  57 +
  58 +| 名称 | 用途 | 范围/生命周期 |
  59 +|------|------|----------------|
  60 +| **`get_app_config()`** | 解析并缓存全局 `AppConfig` | `config/loader.py`:`@lru_cache(maxsize=1)`;`reload_app_config()` 可 `cache_clear()` |
  61 +| **`TranslationService` 单例** | 翻译服务进程内复用后端与 Redis 客户端 | `api/translator_app.py`:`get_translation_service()` |
  62 +| **`_nllb_tokenizer_code_by_normalized_key`** | NLLB tokenizer 语言码映射 | `translation/languages.py`:`@lru_cache(maxsize=1)` |
  63 +| **`QueryTextAnalysisCache`** | 单次查询解析内复用分词、tokenizer 结果 | `query/tokenization.py`,随 `QueryParser` 一次 parse |
  64 +| **`_SelectionContext`(SKU 意图)** | 归一化文本、分词、匹配布尔等小字典 | `search/sku_intent_selector.py`,单次选择流程 |
  65 +| **`incremental_service` transformer 缓存** | 按 `tenant_id` 缓存文档转换器 | `indexer/incremental_service.py`,**无界**、多租户进程长期存活时需注意内存 |
  66 +| **NLLB batch 内 `token_count_cache`** | 同一 batch 内避免重复计 token | `translation/backends/local_ctranslate2.py` |
  67 +| **CLIP 分词器 `@lru_cache`**(第三方) | 简单 tokenizer 缓存 | `third-party/clip-as-service/.../simple_tokenizer.py` |
  68 +
  69 +**说明**:`utils/cache.py` 中的 **`DictCache`**(文件 JSON:默认 `.cache/dict_cache.json`)已导出,但仓库内**无直接 `DictCache(` 调用**,视为可复用工具/预留,非当前主路径。
  70 +
  71 +---
  72 +
  73 +## 三、磁盘与模型相关「缓存」(非 Redis)
  74 +
  75 +| 名称 | 用途 | 配置/位置 |
  76 +|------|------|-----------|
  77 +| **Hugging Face / 本地模型目录** | 重排器、翻译本地模型等权重下载与缓存 | `services.rerank.backends.*.cache_dir` 等,常见默认 **`./model_cache`**(`config/config.yaml`) |
  78 +| **vLLM `enable_prefix_caching`** | 重排服务内 **Prefix KV 缓存**(加速同前缀批推理) | `services.rerank.backends.qwen3_vllm*`、`reranker/backends/qwen3_vllm*.py` |
  79 +| **运行时目录** | 重排服务状态/引擎文件 | `services.rerank.instances.*.runtime_dir`(如 `./.runtime/reranker/...`) |
  80 +
  81 +翻译能力里的 **`use_cache: true`**(如 NLLB、Marian)在多数后端指 **推理时的 KV cache(Transformer)**,与 Redis 译文缓存是不同层次;Redis 译文缓存仍由 `TranslationCache` 控制。
  82 +
  83 +---
  84 +
  85 +## 四、Elasticsearch 内部缓存
  86 +
  87 +索引设置中的 `refresh_interval` 等影响近实时可见性,但**不属于应用层键值缓存**。若需调优 ES 查询缓存、节点堆等,见运维文档与集群配置,此处不展开。
  88 +
  89 +---
  90 +
  91 +## 五、运维与巡检脚本(Redis)
  92 +
  93 +| 脚本 | 作用 |
  94 +|------|------|
  95 +| `scripts/redis/redis_cache_health_check.py` | 按 **embedding / translation / anchors** 三类前缀巡检:key 数量估算、TTL 采样、`IDLETIME` 等 |
  96 +| `scripts/redis/redis_cache_prefix_stats.py` | 按前缀统计 key 数量与 **MEMORY USAGE**(可多 DB) |
  97 +| `scripts/redis/redis_memory_heavy_keys.py` | 扫描占用内存最大的 key,辅助排查「统计与总内存不一致」 |
  98 +| `scripts/redis/monitor_eviction.py` | 实时监控 **eviction** 相关事件,用于容量与驱逐策略排查 |
  99 +
  100 +使用前需加载项目配置(如 `source activate.sh`)以保证 `REDIS_CONFIG` 与生产一致。脚本注释中给出了 **`redis-cli` 手工统计**示例(按前缀 `wc -l`、`MEMORY STATS` 等)。
  101 +
  102 +---
  103 +
  104 +## 六、总表(Redis 与各层缓存)
  105 +
  106 +| 缓存名称 | 业务模块 | 存储 | Key 前缀 / 命名模式 | 过期时间 | 过期策略 | 值摘要 | 配置键 / 环境变量 |
  107 +|----------|----------|------|---------------------|----------|----------|--------|-------------------|
  108 +| 文本向量 | 检索 / 索引 / Embedding 服务 | Redis db≈0 | `{embedding_cache_prefix}:*`(逻辑键以 `embed:norm…` 开头) | `cache_expire_days`(默认 720 天) | 写入 TTL + 命中滑动续期 | BF16 字节向量 | `infrastructure.redis.*`;`REDIS_EMBEDDING_CACHE_PREFIX`、`REDIS_CACHE_EXPIRE_DAYS` |
  109 +| 图像向量(CLIP 图) | 图搜 / 多模态 | 同上 | `{prefix}:image:*` | 同上 | 同上 | BF16 字节 | 同上 |
  110 +| CLIP 文本塔向量 | 图搜文本侧 | 同上 | `{prefix}:clip_text:*` | 同上 | 同上 | BF16 字节 | 同上 |
  111 +| 翻译译文 | 查询翻译、翻译服务 | 同上 | `trans:{model}:{lang}:*` | `services.translation.cache.ttl_seconds`(默认 720 天) | 可配置滑动(`sliding_expiration`) | UTF-8 字符串 | `services.translation.cache.*`;各能力 `use_cache` |
  112 +| 商品分析 / Anchors | 索引富化、LLM 内容理解 | 同上 | `{anchor_cache_prefix}:{kind}:{hash}:{lang}:*` | `anchor_cache_expire_days`(默认 30 天) | 固定 TTL,不滑动 | JSON 字符串 | `anchor_cache_prefix`、`anchor_cache_expire_days`;`REDIS_ANCHOR_*` |
  113 +| 应用配置 | 全栈 | 进程内存 | N/A(单例) | 进程生命周期 | `reload_app_config` 清除 | `AppConfig` 对象 | `config/loader.py` |
  114 +| 翻译服务实例 | 翻译 API | 进程内存 | N/A | 进程生命周期 | 单例 | `TranslationService` | `api/translator_app.py` |
  115 +| 查询分词缓存 | 查询解析 | 单次请求内 | N/A | 单次 parse | — | 分词与中间结果 | `query/tokenization.py` |
  116 +| SKU 意图辅助字典 | 搜索排序辅助 | 单次请求内 | N/A | 单次选择 | — | 小 dict | `search/sku_intent_selector.py` |
  117 +| 增量索引 Transformer | 索引管道 | 进程内存 | `tenant_id` 字符串键 | 长期(无界) | 无自动淘汰 | Transformer 元组 | `indexer/incremental_service.py` |
  118 +| 重排 / 翻译模型权重 | 推理服务 | 本地磁盘 | 目录路径 | 无自动删除(人工清理) | — | 模型文件 | `cache_dir: ./model_cache` 等 |
  119 +| vLLM Prefix 缓存 | 重排(Qwen3 等) | GPU/引擎内 | 引擎内部 | 引擎管理 | — | KV Cache | `enable_prefix_caching` |
  120 +| 文件 Dict 缓存(可选) | 通用 | `.cache/dict_cache.json` | 分类 + 自定义 key | 持久直至删除 | — | JSON 可序列化值 | `utils/cache.py`(当前无调用方) |
  121 +
  122 +---
  123 +
  124 +## 七、维护建议(简要)
  125 +
  126 +1. **容量**:三类 Redis 缓存(embedding / trans / anchors)可共用同一实例;大租户或图搜多时 **embedding** 与 **trans** 往往占主要内存,可用 `redis_cache_prefix_stats.py` 分前缀观察。
  127 +2. **键迁移**:变更 `embedding_cache_prefix`、CLIP `model_name` 或 prompt 契约会自然**隔离新键空间**;旧键依赖 TTL 或人工批量删除。
  128 +3. **一致性**:向量缓存对异常向量会 **delete key**(`RedisEmbeddingCache.get`);anchors 依赖 `cache_version` 与契约 hash 防止错误复用。
  129 +4. **监控**:除脚本外,Embedding HTTP 服务健康检查会报告各 lane 的 **`cache_enabled`**(`embeddings/server.py`)。
  130 +
  131 +---
  132 +
  133 +*文档随代码扫描生成;若新增 Redis 用途,请同步更新本文件与 `scripts/redis/redis_cache_health_check.py` 中的 `_load_known_cache_types()`。*
... ...
docs/issues/issue-2026-04-08-eval框架主指标ERR的问题以及bm25调参-done-0408.md 0 → 100644
... ... @@ -0,0 +1,120 @@
  1 +1. 目前检索系统评测的主要指标是这几个
  2 + "NDCG@20, NDCG@50, ERR@10, Strong_Precision@10, Strong_Precision@20, "
  3 +参考_err_at_k,计算逻辑好像没问题
  4 +现在的问题是,ERR 指标跟其他几个指标好像经常有相反的趋势。请再分析他是否适合作为主指标之一,目前有什么问题。
  5 +
  6 +2. 目前bm25参数是:
  7 +"b": 0.1,
  8 +"k1": 0.3
  9 +对应的基线是 /data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260408T055948Z_00b6a8aa3d.md (Primary_Metric_Score: 0.604555
  10 +
  11 +)
  12 +
  13 +(比之前b和k1都设置为0好了很多,之前都设置为0的情况:/data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260407T150946Z_00b6a8aa3d.md
  14 + Primary_Metric_Score: 0.602598
  15 +
  16 +)
  17 +
  18 +这两个参数从0改为0.1/0.3的背景是:
  19 +This change adjusts the BM25 parameters used by the combined query.
  20 +
  21 +Previously, both `b` and `k1` were set to `0.0`. The original intention was to avoid two common issues in e-commerce search relevance:
  22 +
  23 +1. Over-penalizing longer product titles
  24 + In product search, a shorter title should not automatically rank higher just because BM25 favors shorter fields. For example, for a query like “遥控车”, a product whose title is simply “遥控车” is not necessarily a better candidate than a product with a slightly longer but more descriptive title. In practice, extremely short titles may even indicate lower-quality catalog data.
  25 +
  26 +2. Over-rewarding repeated occurrences of the same term
  27 + For longer queries such as “遥控喷雾翻滚多功能车玩具车”, the default BM25 behavior may give too much weight to a term that appears multiple times (for example “遥控”), even when other important query terms such as “喷雾” or “翻滚” are missing. This can cause products with repeated partial matches to outrank products that actually cover more of the user intent.
  28 +
  29 +Setting both parameters to zero was an intentional way to suppress length normalization and term-frequency amplification. However, after introducing a `combined_fields` query, this configuration becomes too aggressive. Since `combined_fields` scores multiple fields as a unified relevance signal, completely disabling both effects may also remove useful ranking information, especially when we still want documents matching more query terms across fields to be distinguishable from weaker matches.
  30 +
  31 +This update therefore relaxes the previous setting and reintroduces a controlled amount of BM25 normalization/scoring behavior. The goal is to keep the original intent — avoiding short-title bias and excessive repeated-term gain — while allowing the combined query to better preserve meaningful relevance differences across candidates.
  32 +
  33 +Expected effect:
  34 +- reduce the bias toward unnaturally short product titles
  35 +- limit score inflation caused by repeated occurrences of the same term
  36 +- improve ranking stability for `combined_fields` queries
  37 +- better reward candidates that cover more of the overall query intent, instead of those that only repeat a subset of terms
  38 +
  39 +
  40 +因为实验有效,因此帮我继续进行实验
  41 +
  42 +请帮我再进行这四轮实验,对比效果,优化bm25参数:
  43 +{ "b": 0.10, "k1": 0.30 }
  44 +{ "b": 0.20, "k1": 0.60 }
  45 +{ "b": 0.50, "k1": 1.0 }
  46 +{ "b": 0.10, "k1": 0.75 }
  47 +
  48 +参考修改索引级设置的方法:( BM25 `similarity.default`)
  49 +
  50 +`mappings/search_products.json` 里的 `settings.similarity` 只在**创建新索引**时生效;**已有索引**需先关闭索引,再 `PUT _settings`,最后重新打开。
  51 +
  52 +**适用场景**:调整默认 BM25 的 `b`、`k1`(例如与仓库映射对齐:`b: 0.1`、`k1: 0.3`)。
  53 +
  54 +```bash
  55 +# 按需替换:索引名、账号密码、ES 地址
  56 +INDEX="search_products_tenant_163"
  57 +AUTH='saas:4hOaLaf41y2VuI8y'
  58 +ES="http://localhost:9200"
  59 +
  60 +# 1) 关闭索引(写入类请求会失败,注意维护窗口)
  61 +curl -s -u "$AUTH" -X POST "$ES/${INDEX}/_close"
  62 +
  63 +# 2) 更新设置(仅示例:与 mappings 中 default 一致时可照抄)
  64 +curl -s -u "$AUTH" -X PUT "$ES/${INDEX}/_settings" \
  65 + -H 'Content-Type: application/json' \
  66 + -d '{
  67 + "index": {
  68 + "similarity": {
  69 + "default": {
  70 + "type": "BM25",
  71 + "b": 0.1,
  72 + "k1": 0.3
  73 + }
  74 + }
  75 + }
  76 +}'
  77 +
  78 +# 3) 重新打开索引
  79 +curl -s -u "$AUTH" -X POST "$ES/${INDEX}/_open"
  80 +```
  81 +
  82 +**检查是否生效**:
  83 +
  84 +```bash
  85 +curl -s -u "$AUTH" -X GET "$ES/${INDEX}/_settings?filter_path=**.similarity&pretty"
  86 +```
  87 +
  88 +期望在响应中看到 `similarity.default` 的 `type`、`b`、`k1`(API 可能将数值以字符串形式返回,属正常)。
  89 +
  90 +**多租户批量**:先列出索引,再对每个 `search_products_tenant_*` 重复上述 close → settings → open。
  91 +
  92 +```bash
  93 +curl -s -u "$AUTH" -X GET "$ES/_cat/indices/search_products_tenant_*?h=index&v"
  94 +```
  95 +
  96 +
  97 +
  98 +实验结果:
  99 +已完成四轮实验,并把 `search_products_tenant_163` 当前索引参数留在了最优组:`b=0.5, k1=1.0`。刚刚已校验生效。
  100 +
  101 +对比结果如下:
  102 +
  103 +- 基线当前线上组 `b=0.10, k1=0.30`:`0.604555`
  104 + 报告:[batch_20260408T055948Z_00b6a8aa3d.md](/data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260408T055948Z_00b6a8aa3d.md)
  105 +- 复跑 `b=0.10, k1=0.30`:`0.604298`
  106 + 报告:[batch_20260408T064011Z_00b6a8aa3d.json](/data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260408T064011Z_00b6a8aa3d.json)
  107 +- 实验 `b=0.20, k1=0.60`:`0.604274`
  108 + 报告:[batch_20260408T064123Z_00b6a8aa3d.json](/data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260408T064123Z_00b6a8aa3d.json)
  109 +- 实验 `b=0.50, k1=1.0`:`0.606041`
  110 + 报告:[batch_20260408T064216Z_00b6a8aa3d.json](/data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260408T064216Z_00b6a8aa3d.json)
  111 +- 实验 `b=0.10, k1=0.75`:`0.603438`
  112 + 报告:[batch_20260408T064304Z_00b6a8aa3d.json](/data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260408T064304Z_00b6a8aa3d.json)
  113 +- 老基线 `b=0.0, k1=0.0`:`0.602598`
  114 + 报告:[batch_20260407T150946Z_00b6a8aa3d.md](/data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260407T150946Z_00b6a8aa3d.md)
  115 +
  116 +结论是这四组里 `b=0.50, k1=1.0` 最好,相比你给的当前基线 `0.604555` 提升了 `+0.001486`,相比 `0/0` 提升了 `+0.003443`。`0.20/0.60` 基本和 `0.10/0.30` 持平,`0.10/0.75` 反而更差,说明这轮里单独加大 `k1` 不如同时把 `b` 和 `k1` 一起放宽。
  117 +
  118 +补充一点:我只改了现有索引 `search_products_tenant_163` 的 index-level setting,没有改仓库里的默认映射。[search_products.json](/data/saas-search/mappings/search_products.json) 现在仍是 `0.1/0.3`。如果你要,我下一步可以直接帮你把映射默认值也改成 `0.5/1.0`,并顺手批量同步其他 `search_products_tenant_*` 索引。
  119 +
  120 +
... ...
docs/issues/issue-2026-04-12-test-env.md 0 → 100644
... ... @@ -0,0 +1,43 @@
  1 +120.76.41.98 端口22 用户名和密码:
  2 +tw twtw@123 (有sudo权限)
  3 +这台机器上的目录/home/tw/saas-search 已经部署了本项目
  4 +请帮我运行项目
  5 +1. 帮我checkout一个test环境的分支,这个分支,把重排、翻译模型 都关闭掉,因为这台机gpu显存较小(embedding模型可以保留)
  6 +2. 在这个分支,把服务都启动起来
  7 +3. 使用docker,安装一个ES,参考本项目的文档 ES9*.md。因为这台机器已经有一个系统的elasticsearch,为了不相互干扰,将本项目依赖的es9安装到docker,并且在测试环境配置的es地址做适配的工作
  8 +
  9 +
  10 +1. 不是要禁用6005,而是6005端口已经有对应的文本服务了,直接用就行
  11 +2. 6005其实就是本项目的一个历史早期版本启动起来的,在另外一个目录:/home/tw/SearchEngine,请看他的启动配置
  12 +nohup bash scripts/start_embedding_service.sh > log.start_embedding_service.0412 2>&1 &
  13 +是这样启动起来的
  14 +看他陪的文本是用的哪套方案、哪个模型,跟他对齐(我指的是当前的测试分支)
  15 +
  16 +
  17 +
  18 +
  19 +
  20 +
  21 +
  22 +我在这个机器上部署了一个测试环境:
  23 +120.76.41.98 端口22 用户名和密码:
  24 +tw twtw@123 (有sudo权限)
  25 +cd /home/tw/saas-search
  26 +$ git branch
  27 + masters RETURN)
  28 +* test/small-gpu-es9
  29 +
  30 +我希望差异只是:
  31 +1. es配置不同(测试环境要连接到那台机器的一个docer的es 19200端口)、redis配置不同
  32 +2. reranker关闭、不要启动reranker服务
  33 +
  34 +其余没什么不同。
  35 +
  36 +但是启动有问题,现在翻译报错。
  37 +这体现了当前项目移植性比较差,我希望你检查一下失败原因,然后先到本地(本机 即当前目录master分支)优化好、提升移植性之后,那边更新,保持测试分支跟master只有少量的、配置层面的不同,让后到测试机器把翻译启动起来,最后包括整个服务都要启动起来。
  38 +
  39 +
  40 +
  41 +
  42 +
  43 +
... ...
docs/issues/issue-2026-04-14-粗排流程放入ES-TODO-env 0 → 100644
... ... @@ -0,0 +1,25 @@
  1 +需求:
  2 +目前160条结果(rerank_window: 160)会进入重排,重排中 文本和图片向量的相关性,都会作为融合公式的因子之一(粗排和reranker都有):
  3 +knn_score
  4 +text_knn
  5 +image_knn
  6 +text_factor
  7 +knn_factor
  8 +但是文本向量召回和图片向量召回,是使用 KNN 索引召回的方式,并不是所有结果都有这两个得分,这两项得分都有为0的。
  9 +为了解决这个问题,有一个方法是对最终能进入重排的 160 条,看其中还有哪些分别缺失文本和图片向量召回的得分,再通过某种方式让 ES 去算,或者从 ES 把向量拉回来,自己算,或者在召回的时候请求 ES 的时候,就通过某种设定,确保前面的若干条都带有这两个分数,不知道有哪些方法,我感觉这些方法都不太好,请你思考一下
  10 +
  11 +考虑的一个方案:
  12 +想在“第一次 ES 搜索”里,只对 topN 补向量精算,考虑 rescore 或 retriever.rescorer的方案(官方明确支持多段 rescore/支持 score_mode: multiply,甚至示例里就有 function_score/script_score 放进 rescore 的写法。)
  13 +这意味着你完全可以:
  14 +初检仍然用现在的 lexical + text knn + image knn 召回候选
  15 +对 window_size=160 做 rescore
  16 +用 exact script_score 给 top160 补 text/image vector 分
  17 +顺手把你现在本地 coarse 融合迁回 ES
  18 +
  19 +export ES_AUTH="saas:4hOaLaf41y2VuI8y"
  20 +export ES="http://127.0.0.1:9200"
  21 +"index":"search_products_tenant_163"
  22 +
  23 +有个细节暴露出来了:dotProduct() 这类向量函数在 script_score 评分上下文能用,但在 script_fields 取字段上下文里不认。所以如果我们要把 exact 分顺手回传给 rerank,用 script_fields 的话得自己写数组循环,不能直接调向量内建函数。
  24 +
  25 +重排打分公式需要的base_query base_query_trans_zh knn_query image_knn_query还能不能拿到?请你考虑,尽量想想如何得到这些打分,如果实在拿不到去想替代的办法比如简化打分公式。
... ...
docs/工作总结-微服务性能优化与架构.md
... ... @@ -98,7 +98,7 @@ instruction: &quot;Given a shopping query, rank product titles by relevance&quot;
98 98 **能力**:支持根据商品标题批量生成 **qanchors**(锚文本)、**enriched_attributes**、**tags**,供索引与 suggest 使用。
99 99  
100 100 **具体内容**:
101   -- **接口**:`POST /indexer/enrich-content`(Indexer 服务端口 **6004**)。请求体为 `items` 数组,每项含 `spu_id`、`title`(必填)及可选多语言标题等;单次请求最多 **50 条**,建议批量调用。响应 `results` 与 `items` 一一对应,每项含 `spu_id`、`qanchors`(按语言键,如 `qanchors.zh`、`qanchors.en`,逗号分隔短语)、`enriched_attributes`、`tags`。
  101 +- **接口**:`POST /indexer/enrich-content`(FacetAwareMatching 服务端口 **6001**)。请求体为 `items` 数组,每项含 `spu_id`、`title`(必填)及可选多语言标题等;单次请求最多 **50 条**,建议批量调用。响应 `results` 与 `items` 一一对应,每项含 `spu_id`、`qanchors`(按语言键,如 `qanchors.zh`、`qanchors.en`,逗号分隔短语)、`enriched_attributes`、`tags`。
102 102 - **索引侧**:微服务组合方式下,调用方先拿不含 qanchors/tags 的 doc,再调用本接口补齐后写入 ES 的 `qanchors.{lang}` 等字段;索引 transformer(`indexer/document_transformer.py`、`indexer/product_enrich.py`)内也可在构建 doc 时调用内容理解逻辑,写入 `qanchors.{lang}`。
103 103 - **Suggest 侧**:`suggestion/builder.py` 从 ES 商品索引读取 `_source: ["id", "spu_id", "title", "qanchors"]`,对 `qanchors.{lang}` 用 `_split_qanchors` 拆成词条,以 `source="qanchor"` 加入候选,排序时 `qanchor` 权重大于纯 title(`add_product("qanchor", ...)`);suggest 配置中 `sources: ["query_log", "qanchor"]` 表示候选来源包含 qanchor。
104 104 - **实现与依赖**:内容理解内部使用大模型(需 `DASHSCOPE_API_KEY`),支持多语言与 Redis 缓存(如 `product_anchors`);逻辑与 `indexer/product_enrich` 一致。
... ... @@ -129,12 +129,12 @@ instruction: &quot;Given a shopping query, rank product titles by relevance&quot;
129 129 - 可选:embedding(text) **6005**、embedding-image **6008**、translator **6006**、reranker **6007**、tei **8080**、cnclip **51000**。
130 130 - 端口可由环境变量覆盖:`API_PORT`、`INDEXER_PORT`、`FRONTEND_PORT`、`EVAL_WEB_PORT`、`EMBEDDING_TEXT_PORT`、`EMBEDDING_IMAGE_PORT`、`TRANSLATION_PORT`、`RERANKER_PORT`、`TEI_PORT`、`CNCLIP_PORT`。
131 131 - **命令**:
132   - - `./scripts/service_ctl.sh start [service...]` 或 `up all` / `start all`(all 含 tei、cnclip、embedding、embedding-image、translator、reranker、reranker-fine、backend、indexer、frontend、eval-web,按依赖顺序);`stop`、`restart`、`down` 同参数;`status` 默认列出所有服务。
  132 + - `./scripts/service_ctl.sh start [service...]` 或 `up all` / `start all`(all 含 tei、cnclip、embedding、embedding-image、translator、reranker、backend、indexer、frontend、eval-web,按依赖顺序);`stop`、`restart`、`down` 同参数;`status` 默认列出所有服务。
133 133 - 启动时:backend/indexer/frontend/embedding/translator/reranker 会写 pid 到 `logs/<service>.pid`,并执行 `wait_for_health`(GET `http://127.0.0.1:<port>/health`);reranker 健康重试 90 次,其余 30 次;TEI 校验 Docker 容器存在且 `/health` 成功;cnclip 无 HTTP 健康则仅校验进程/端口。
134 134 - **监控常驻**:
135 135 - `./scripts/service_ctl.sh monitor-start <targets>` 启动后台监控进程,将 targets 写入 `logs/service-monitor.targets`,pid 写入 `logs/service-monitor.pid`,日志追加到 `logs/service-monitor.log`。
136   - - 轮询间隔 `MONITOR_INTERVAL_SEC` 默认 **10** 秒;连续 **3** 次(`MONITOR_FAIL_THRESHOLD`)健康失败则触发重启;重启冷却 `MONITOR_RESTART_COOLDOWN_SEC` 默认 **30** 秒;每小时最多重启 `MONITOR_MAX_RESTARTS_PER_HOUR` 默认 **6** 次;超限时调用 `scripts/wechat_alert.py` 告警(若存在)。
137   -- **日志**:各服务按日滚动到 `logs/<service>-<date>.log`,通过 `scripts/daily_log_router.sh` 与 `LOG_RETENTION_DAYS`(默认 30)控制保留。
  136 + - 轮询间隔 `MONITOR_INTERVAL_SEC` 默认 **10** 秒;连续 **3** 次(`MONITOR_FAIL_THRESHOLD`)健康失败则触发重启;重启冷却 `MONITOR_RESTART_COOLDOWN_SEC` 默认 **30** 秒;每小时最多重启 `MONITOR_MAX_RESTARTS_PER_HOUR` 默认 **6** 次;超限时调用 `scripts/ops/wechat_alert.py` 告警(若存在)。
  137 +- **日志**:各服务按日滚动到 `logs/<service>-<date>.log`,通过 `scripts/ops/daily_log_router.sh` 与 `LOG_RETENTION_DAYS`(默认 30)控制保留。
138 138  
139 139 详见:`scripts/service_ctl.sh` 内注释及 `docs/Usage-Guide.md`。
140 140  
... ... @@ -153,12 +153,12 @@ instruction: &quot;Given a shopping query, rank product titles by relevance&quot;
153 153  
154 154 ## 三、性能测试报告摘要
155 155  
156   -以下数据来自 `docs/性能测试报告.md`,测试时间 **2026-03-12**,环境:**8 vCPU**(Intel Xeon Platinum 8255C @ 2.50GHz)、**约 15Gi 可用内存**;租户 **162** 文档数约 **53**(search/search/suggestions/rerank 与文档规模相关)。压测工具:`scripts/perf_api_benchmark.py`,场景×并发矩阵,每档 **20s**。
  156 +以下数据来自 `docs/性能测试报告.md`,测试时间 **2026-03-12**,环境:**8 vCPU**(Intel Xeon Platinum 8255C @ 2.50GHz)、**约 15Gi 可用内存**;租户 **162** 文档数约 **53**(search/search/suggestions/rerank 与文档规模相关)。压测工具:`benchmarks/perf_api_benchmark.py`,场景×并发矩阵,每档 **20s**。
157 157  
158 158 **复现命令(四场景×四并发)**:
159 159 ```bash
160 160 cd /data/saas-search
161   -.venv/bin/python scripts/perf_api_benchmark.py \
  161 +.venv/bin/python benchmarks/perf_api_benchmark.py \
162 162 --scenario backend_search,backend_suggest,embed_text,rerank \
163 163 --concurrency-list 1,5,10,20 \
164 164 --duration 20 \
... ... @@ -188,7 +188,7 @@ cd /data/saas-search
188 188  
189 189 口径:query 固定 `wireless mouse`,每次请求 **386 docs**,句长 15–40 词随机(从 1000 词池采样);配置 `rerank_window=384`。复现命令:
190 190 ```bash
191   -.venv/bin/python scripts/perf_api_benchmark.py \
  191 +.venv/bin/python benchmarks/perf_api_benchmark.py \
192 192 --scenario rerank --duration 20 --concurrency-list 1,5,10,20 --timeout 60 \
193 193 --rerank-dynamic-docs --rerank-doc-count 386 --rerank-vocab-size 1000 \
194 194 --rerank-sentence-min-words 15 --rerank-sentence-max-words 40 \
... ... @@ -217,7 +217,7 @@ cd /data/saas-search
217 217 | 10 | 181 | 100% | 8.78 | 1129.23| 1295.88| 1330.96|
218 218 | 20 | 161 | 100% | 7.63 | 2594.00| 4706.44| 4783.05|
219 219  
220   -**结论**:吞吐约 **8 rps** 平台化,延迟随并发上升明显,符合“检索 + 向量 + 重排”重链路特征。多租户补测(文档数 500–10000,见报告 §12)表明:文档数越大,RPS 下降、延迟升高;tenant 0(10000 doc)在并发 20 出现部分 ReadTimeout(成功率 59.02%),需注意 timeout 与容量规划;补测命令示例:`for t in 0 1 2 3 4; do .venv/bin/python scripts/perf_api_benchmark.py --scenario backend_search --concurrency-list 1,5,10,20 --duration 20 --tenant-id $t --output perf_reports/2026-03-12/search_tenant_matrix/tenant_${t}.json; done`。
  220 +**结论**:吞吐约 **8 rps** 平台化,延迟随并发上升明显,符合“检索 + 向量 + 重排”重链路特征。多租户补测(文档数 500–10000,见报告 §12)表明:文档数越大,RPS 下降、延迟升高;tenant 0(10000 doc)在并发 20 出现部分 ReadTimeout(成功率 59.02%),需注意 timeout 与容量规划;补测命令示例:`for t in 0 1 2 3 4; do .venv/bin/python benchmarks/perf_api_benchmark.py --scenario backend_search --concurrency-list 1,5,10,20 --duration 20 --tenant-id $t --output perf_reports/2026-03-12/search_tenant_matrix/tenant_${t}.json; done`。
221 221  
222 222 ---
223 223  
... ... @@ -247,5 +247,5 @@ cd /data/saas-search
247 247  
248 248 **关键文件与复现**:
249 249 - 配置:`config/config.yaml`(services、rerank、query_config)、`.env`(端口与 API Key)。
250   -- 脚本:`scripts/service_ctl.sh`(启停与监控)、`scripts/perf_api_benchmark.py`(压测)、`scripts/build_suggestions.sh`(suggest 构建)。
  250 +- 脚本:`scripts/service_ctl.sh`(启停与监控)、`benchmarks/perf_api_benchmark.py`(压测)、`scripts/build_suggestions.sh`(suggest 构建)。
251 251 - 完整步骤与多租户/rerank 对比见:`docs/性能测试报告.md`。
... ...
docs/常用查询 - ES.md
1   -
2   -
3 1 ## Elasticsearch 排查流程
4 2  
  3 +使用前加载环境变量:
  4 +```bash
  5 +set -a; source .env; set +a
  6 +# 或直接 export
  7 +export ES_AUTH="saas:4hOaLaf41y2VuI8y"
  8 +export ES="http://127.0.0.1:9200"
  9 +```
  10 +
5 11 ### 1. 集群健康状态
6 12  
7 13 ```bash
8 14 # 集群整体健康(green / yellow / red)
9   -curl -s -u 'saas:4hOaLaf41y2VuI8y' 'http://127.0.0.1:9200/_cluster/health?pretty'
  15 +curl -s -u "$ES_AUTH" 'http://127.0.0.1:9200/_cluster/health?pretty'
10 16 ```
11 17  
12 18 ### 2. 索引概览
13 19  
14 20 ```bash
15 21 # 查看所有租户索引状态与体积
16   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/_cat/indices/search_products_tenant_*?v'
  22 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/_cat/indices/search_products_tenant_*?v'
17 23  
18 24 # 或查看全部索引
19   -curl -s -u 'saas:4hOaLaf41y2VuI8y' 'http://127.0.0.1:9200/_cat/indices?v'
  25 +curl -s -u "$ES_AUTH" 'http://127.0.0.1:9200/_cat/indices?v'
20 26 ```
21 27  
22 28 ### 3. 分片分布
23 29  
24 30 ```bash
25 31 # 查看分片在各节点的分布情况
26   -curl -s -u 'saas:4hOaLaf41y2VuI8y' 'http://127.0.0.1:9200/_cat/shards?v'
  32 +curl -s -u "$ES_AUTH" 'http://127.0.0.1:9200/_cat/shards?v'
27 33 ```
28 34  
29 35 ### 4. 分配诊断(如有异常)
30 36  
31 37 ```bash
32 38 # 当 health 非 green 或 shards 状态异常时,定位具体原因
33   -curl -s -u 'saas:4hOaLaf41y2VuI8y' -X POST 'http://127.0.0.1:9200/_cluster/allocation/explain?pretty' \
  39 +curl -s -u "$ES_AUTH" -X POST 'http://127.0.0.1:9200/_cluster/allocation/explain?pretty' \
34 40 -H 'Content-Type: application/json' \
35 41 -d '{"index":"search_products_tenant_163","shard":0,"primary":true}'
36 42 ```
... ... @@ -60,6 +66,54 @@ cat /etc/elasticsearch/elasticsearch.yml
60 66 journalctl -u elasticsearch -f
61 67 ```
62 68  
  69 +### 7. 修改索引级设置(如 BM25 `similarity.default`)
  70 +
  71 +`mappings/search_products.json` 里的 `settings.similarity` 只在**创建新索引**时生效;**已有索引**需先关闭索引,再 `PUT _settings`,最后重新打开。
  72 +
  73 +**适用场景**:调整默认 BM25 的 `b`、`k1`(例如与仓库映射对齐:`b: 0.1`、`k1: 0.3`)。
  74 +
  75 +```bash
  76 +# 按需替换:索引名、账号密码、ES 地址
  77 +INDEX="search_products_tenant_163"
  78 +AUTH="$ES_AUTH"
  79 +ES="http://localhost:9200"
  80 +
  81 +# 1) 关闭索引(写入类请求会失败,注意维护窗口)
  82 +curl -s -u "$AUTH" -X POST "$ES/${INDEX}/_close"
  83 +
  84 +# 2) 更新设置(仅示例:与 mappings 中 default 一致时可照抄)
  85 +curl -s -u "$AUTH" -X PUT "$ES/${INDEX}/_settings" \
  86 + -H 'Content-Type: application/json' \
  87 + -d '{
  88 + "index": {
  89 + "similarity": {
  90 + "default": {
  91 + "type": "BM25",
  92 + "b": 0.1,
  93 + "k1": 0.3
  94 + }
  95 + }
  96 + }
  97 +}'
  98 +
  99 +# 3) 重新打开索引
  100 +curl -s -u "$AUTH" -X POST "$ES/${INDEX}/_open"
  101 +```
  102 +
  103 +**检查是否生效**:
  104 +
  105 +```bash
  106 +curl -s -u "$AUTH" -X GET "$ES/${INDEX}/_settings?filter_path=**.similarity&pretty"
  107 +```
  108 +
  109 +期望在响应中看到 `similarity.default` 的 `type`、`b`、`k1`(API 可能将数值以字符串形式返回,属正常)。
  110 +
  111 +**多租户批量**:先列出索引,再对每个 `search_products_tenant_*` 重复上述 close → settings → open。
  112 +
  113 +```bash
  114 +curl -s -u "$AUTH" -X GET "$ES/_cat/indices/search_products_tenant_*?h=index&v"
  115 +```
  116 +
63 117 ---
64 118  
65 119 ### 快速排查路径
... ... @@ -93,7 +147,7 @@ systemctl / df / 日志 → 系统层验证
93 147  
94 148 #### 查询指定 spu_id 的商品(返回 title)
95 149 ```bash
96   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
  150 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
97 151 "size": 11,
98 152 "_source": ["title"],
99 153 "query": {
... ... @@ -108,7 +162,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
108 162  
109 163 #### 查询所有商品(返回 title)
110 164 ```bash
111   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
  165 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
112 166 "size": 100,
113 167 "_source": ["title"],
114 168 "query": {
... ... @@ -119,7 +173,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
119 173  
120 174 #### 查询指定 spu_id 的商品(返回 title、keywords、tags)
121 175 ```bash
122   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
  176 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
123 177 "size": 5,
124 178 "_source": ["title", "keywords", "tags"],
125 179 "query": {
... ... @@ -134,7 +188,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
134 188  
135 189 #### 组合查询:匹配标题 + 过滤标签
136 190 ```bash
137   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
  191 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
138 192 "size": 1,
139 193 "_source": ["title", "keywords", "tags"],
140 194 "query": {
... ... @@ -158,7 +212,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
158 212  
159 213 #### 组合查询:匹配标题 + 过滤租户(冗余示例)
160 214 ```bash
161   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
  215 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
162 216 "size": 1,
163 217 "_source": ["title"],
164 218 "query": {
... ... @@ -186,7 +240,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
186 240  
187 241 #### 测试 index_ik 分析器
188 242 ```bash
189   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_analyze' -H 'Content-Type: application/json' -d '{
  243 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_analyze' -H 'Content-Type: application/json' -d '{
190 244 "analyzer": "index_ik",
191 245 "text": "14寸第4代-眼珠实身冰雪公仔带手动大推车,搪胶雪宝宝"
192 246 }'
... ... @@ -194,7 +248,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
194 248  
195 249 #### 测试 query_ik 分析器
196 250 ```bash
197   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_analyze' -H 'Content-Type: application/json' -d '{
  251 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_analyze' -H 'Content-Type: application/json' -d '{
198 252 "analyzer": "query_ik",
199 253 "text": "14寸第4代-眼珠实身冰雪公仔带手动大推车,搪胶雪宝宝"
200 254 }'
... ... @@ -206,7 +260,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
206 260  
207 261 #### 多字段匹配 + 聚合(category1、color、size、material)
208 262 ```bash
209   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
  263 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{
210 264 "size": 1,
211 265 "from": 0,
212 266 "query": {
... ... @@ -316,7 +370,7 @@ GET /search_products_tenant_2/_search
316 370  
317 371 #### 按 spu_id 查询(通用索引)
318 372 ```bash
319   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products/_search?pretty' -H 'Content-Type: application/json' -d '{
  373 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products/_search?pretty' -H 'Content-Type: application/json' -d '{
320 374 "size": 5,
321 375 "query": {
322 376 "bool": {
... ... @@ -333,7 +387,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products/_s
333 387 ### 5. 统计租户总文档数
334 388  
335 389 ```bash
336   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_count?pretty' -H 'Content-Type: application/json' -d '{
  390 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_count?pretty' -H 'Content-Type: application/json' -d '{
337 391 "query": {
338 392 "match_all": {}
339 393 }
... ... @@ -348,7 +402,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
348 402  
349 403 #### 1.1 查询特定租户的商品,显示分面相关字段
350 404 ```bash
351   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  405 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
352 406 "query": {
353 407 "term": { "tenant_id": "162" }
354 408 },
... ... @@ -363,7 +417,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
363 417  
364 418 #### 1.2 验证 category1_name 字段是否有数据
365 419 ```bash
366   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  420 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
367 421 "query": {
368 422 "bool": {
369 423 "filter": [
... ... @@ -378,7 +432,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
378 432  
379 433 #### 1.3 验证 specifications 字段是否有数据
380 434 ```bash
381   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  435 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
382 436 "query": {
383 437 "bool": {
384 438 "filter": [
... ... @@ -397,7 +451,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
397 451  
398 452 #### 2.1 category1_name 分面聚合
399 453 ```bash
400   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  454 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
401 455 "query": { "match_all": {} },
402 456 "size": 0,
403 457 "aggs": {
... ... @@ -410,7 +464,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
410 464  
411 465 #### 2.2 specifications.color 分面聚合
412 466 ```bash
413   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  467 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
414 468 "query": { "match_all": {} },
415 469 "size": 0,
416 470 "aggs": {
... ... @@ -431,7 +485,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
431 485  
432 486 #### 2.3 specifications.size 分面聚合
433 487 ```bash
434   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  488 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
435 489 "query": { "match_all": {} },
436 490 "size": 0,
437 491 "aggs": {
... ... @@ -452,7 +506,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
452 506  
453 507 #### 2.4 specifications.material 分面聚合
454 508 ```bash
455   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  509 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
456 510 "query": { "match_all": {} },
457 511 "size": 0,
458 512 "aggs": {
... ... @@ -473,7 +527,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
473 527  
474 528 #### 2.5 综合分面聚合(category + color + size + material)
475 529 ```bash
476   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  530 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
477 531 "query": { "match_all": {} },
478 532 "size": 0,
479 533 "aggs": {
... ... @@ -515,7 +569,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
515 569  
516 570 #### 3.1 查看 specifications 的 name 字段有哪些值
517 571 ```bash
518   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products/_search?pretty' -H 'Content-Type: application/json' -d '{
  572 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products/_search?pretty' -H 'Content-Type: application/json' -d '{
519 573 "query": { "term": { "tenant_id": "162" } },
520 574 "size": 0,
521 575 "aggs": {
... ... @@ -531,7 +585,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products/_s
531 585  
532 586 #### 3.2 查看某个商品的完整 specifications 数据
533 587 ```bash
534   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products/_search?pretty' -H 'Content-Type: application/json' -d '{
  588 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products/_search?pretty' -H 'Content-Type: application/json' -d '{
535 589 "query": {
536 590 "bool": {
537 591 "filter": [
... ... @@ -552,7 +606,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products/_s
552 606 **keyword 精确匹配**(示例词:中文 `法式风格`,英文 `long skirt`)
553 607  
554 608 ```bash
555   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  609 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
556 610 "size": 1,
557 611 "_source": ["spu_id", "title", "enriched_attributes"],
558 612 "query": {
... ... @@ -575,7 +629,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
575 629 **text 全文匹配**(经 `index_ik` / `english` 分词;可与上式对照)
576 630  
577 631 ```bash
578   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  632 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
579 633 "size": 1,
580 634 "_source": ["spu_id", "title", "enriched_attributes"],
581 635 "query": {
... ... @@ -602,7 +656,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
602 656 **keyword 精确匹配**
603 657  
604 658 ```bash
605   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  659 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
606 660 "size": 1,
607 661 "_source": ["spu_id", "title", "option1_values"],
608 662 "query": {
... ... @@ -620,7 +674,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
620 674 **text 全文匹配**
621 675  
622 676 ```bash
623   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  677 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
624 678 "size": 1,
625 679 "_source": ["spu_id", "title", "option1_values"],
626 680 "query": {
... ... @@ -640,7 +694,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
640 694 **keyword 精确匹配**
641 695  
642 696 ```bash
643   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  697 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
644 698 "size": 1,
645 699 "_source": ["spu_id", "title", "enriched_tags"],
646 700 "query": {
... ... @@ -658,7 +712,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
658 712 **text 全文匹配**
659 713  
660 714 ```bash
661   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  715 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
662 716 "size": 1,
663 717 "_source": ["spu_id", "title", "enriched_tags"],
664 718 "query": {
... ... @@ -678,7 +732,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
678 732 > `specifications` 为 **nested**,`value_keyword` 为整词匹配;`value_text.*` 可同时 `term` 子字段或 `match` 主 text。
679 733  
680 734 ```bash
681   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  735 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
682 736 "size": 1,
683 737 "_source": ["spu_id", "title", "specifications"],
684 738 "query": {
... ... @@ -710,7 +764,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
710 764  
711 765 #### 4.1 统计有 category1_name 的文档数量
712 766 ```bash
713   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_count?pretty' -H 'Content-Type: application/json' -d '{
  767 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_count?pretty' -H 'Content-Type: application/json' -d '{
714 768 "query": {
715 769 "bool": {
716 770 "filter": [
... ... @@ -723,7 +777,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
723 777  
724 778 #### 4.2 统计有 specifications 的文档数量
725 779 ```bash
726   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_count?pretty' -H 'Content-Type: application/json' -d '{
  780 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_count?pretty' -H 'Content-Type: application/json' -d '{
727 781 "query": {
728 782 "bool": {
729 783 "filter": [
... ... @@ -740,7 +794,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
740 794  
741 795 #### 5.1 查找没有 category1_name 但有 category 的文档(MySQL 有数据但 ES 没有)
742 796 ```bash
743   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  797 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
744 798 "query": {
745 799 "bool": {
746 800 "filter": [
... ... @@ -758,7 +812,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X GET &#39;http://localhost:9200/search_products_te
758 812  
759 813 #### 5.2 查找有 option 但没有 specifications 的文档(数据转换问题)
760 814 ```bash
761   -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
  815 +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{
762 816 "query": {
763 817 "bool": {
764 818 "filter": [
... ... @@ -814,7 +868,7 @@ GET search_products_tenant_163/_mapping
814 868 GET search_products_tenant_163/_field_caps?fields=*
815 869  
816 870 ```bash
817   -curl -u 'saas:4hOaLaf41y2VuI8y' -X POST \
  871 +curl -u "$ES_AUTH" -X POST \
818 872 'http://localhost:9200/search_products_tenant_163/_count' \
819 873 -H 'Content-Type: application/json' \
820 874 -d '{
... ... @@ -827,7 +881,7 @@ curl -u &#39;saas:4hOaLaf41y2VuI8y&#39; -X POST \
827 881 }
828 882 }'
829 883  
830   -curl -u 'saas:4hOaLaf41y2VuI8y' -X POST \
  884 +curl -u "$ES_AUTH" -X POST \
831 885 'http://localhost:9200/search_products_tenant_163/_count' \
832 886 -H 'Content-Type: application/json' \
833 887 -d '{
... ...
docs/性能测试报告.md
... ... @@ -18,13 +18,13 @@
18 18  
19 19 执行方式:
20 20 - 每组压测持续 `20s`
21   -- 使用统一脚本 `scripts/perf_api_benchmark.py`
  21 +- 使用统一脚本 `benchmarks/perf_api_benchmark.py`
22 22 - 通过 `--scenario` 多值 + `--concurrency-list` 一次性跑完 `场景 x 并发`
23 23  
24 24 ## 3. 压测工具优化说明(复用现有脚本)
25 25  
26 26 为了解决原脚本“一次只能跑一个场景+一个并发”的可用性问题,本次直接扩展现有脚本:
27   -- `scripts/perf_api_benchmark.py`
  27 +- `benchmarks/perf_api_benchmark.py`
28 28  
29 29 能力:
30 30 - 一条命令执行 `场景列表 x 并发列表` 全矩阵
... ... @@ -33,7 +33,7 @@
33 33 示例:
34 34  
35 35 ```bash
36   -.venv/bin/python scripts/perf_api_benchmark.py \
  36 +.venv/bin/python benchmarks/perf_api_benchmark.py \
37 37 --scenario backend_search,backend_suggest,embed_text,rerank \
38 38 --concurrency-list 1,5,10,20 \
39 39 --duration 20 \
... ... @@ -106,7 +106,7 @@ curl -sS http://127.0.0.1:6007/health
106 106  
107 107 ```bash
108 108 cd /data/saas-search
109   -.venv/bin/python scripts/perf_api_benchmark.py \
  109 +.venv/bin/python benchmarks/perf_api_benchmark.py \
110 110 --scenario backend_search,backend_suggest,embed_text,rerank \
111 111 --concurrency-list 1,5,10,20 \
112 112 --duration 20 \
... ... @@ -164,7 +164,7 @@ cd /data/saas-search
164 164 复现命令:
165 165  
166 166 ```bash
167   -.venv/bin/python scripts/perf_api_benchmark.py \
  167 +.venv/bin/python benchmarks/perf_api_benchmark.py \
168 168 --scenario rerank \
169 169 --duration 20 \
170 170 --concurrency-list 1,5,10,20 \
... ... @@ -237,7 +237,7 @@ cd /data/saas-search
237 237 - 使用项目虚拟环境执行:
238 238  
239 239 ```bash
240   -.venv/bin/python scripts/perf_api_benchmark.py -h
  240 +.venv/bin/python benchmarks/perf_api_benchmark.py -h
241 241 ```
242 242  
243 243 ### 10.3 某场景成功率下降
... ... @@ -249,7 +249,7 @@ cd /data/saas-search
249 249  
250 250 ## 11. 关联文件
251 251  
252   -- 压测脚本:`scripts/perf_api_benchmark.py`
  252 +- 压测脚本:`benchmarks/perf_api_benchmark.py`
253 253 - 本次结果:`perf_reports/2026-03-12/perf_matrix_report.json`
254 254 - Search 多租户补测:`perf_reports/2026-03-12/search_tenant_matrix/`
255 255 - Reranker 386 docs 口径补测:`perf_reports/2026-03-12/rerank_realistic/rerank_386docs.json`
... ... @@ -280,7 +280,7 @@ cd /data/saas-search
280 280 cd /data/saas-search
281 281 mkdir -p perf_reports/2026-03-12/search_tenant_matrix
282 282 for t in 0 1 2 3 4; do
283   - .venv/bin/python scripts/perf_api_benchmark.py \
  283 + .venv/bin/python benchmarks/perf_api_benchmark.py \
284 284 --scenario backend_search \
285 285 --concurrency-list 1,5,10,20 \
286 286 --duration 20 \
... ...
docs/搜索API对接指南-00-总览与快速开始.md
... ... @@ -90,7 +90,7 @@ curl -X POST &quot;http://43.166.252.75:6002/search/&quot; \
90 90 | 查询文档 | POST | `/indexer/documents` | 查询SPU文档数据(不写入ES) |
91 91 | 构建ES文档(正式对接) | POST | `/indexer/build-docs` | 基于上游提供的 MySQL 行数据构建 ES doc,不写入 ES,供 Java 等调用后自行写入 |
92 92 | 构建ES文档(测试用) | POST | `/indexer/build-docs-from-db` | 仅在测试/调试时使用,根据 `tenant_id + spu_ids` 内部查库并构建 ES doc |
93   -| 内容理解字段生成 | POST | `/indexer/enrich-content` | 根据商品标题批量生成 qanchors、enriched_attributes、tags,供微服务组合方式使用 |
  93 +| 内容理解字段生成 | POST | `/indexer/enrich-content` | 根据商品标题批量生成 qanchors、enriched_attributes、tags,供微服务组合方式使用(独立服务端口 6001) |
94 94 | 索引健康检查 | GET | `/indexer/health` | 检查索引服务状态 |
95 95 | 健康检查 | GET | `/admin/health` | 服务健康检查 |
96 96 | 获取配置 | GET | `/admin/config` | 获取租户配置 |
... ... @@ -104,7 +104,6 @@ curl -X POST &quot;http://43.166.252.75:6002/search/&quot; \
104 104 | 向量服务(图片) | 6008 | `POST /embed/image` | 图片向量化 |
105 105 | 翻译服务 | 6006 | `POST /translate` | 文本翻译(支持 qwen-mt / llm / deepl / 本地模型) |
106 106 | 重排服务 | 6007 | `POST /rerank` | 检索结果重排 |
107   -| 内容理解(Indexer 内) | 6004 | `POST /indexer/enrich-content` | 根据商品标题生成 qanchors、tags 等,供 indexer 微服务组合方式使用 |
  107 +| 内容理解(独立服务) | 6001 | `POST /indexer/enrich-content` | 根据商品标题生成 qanchors、tags 等,供 indexer 微服务组合方式使用 |
108 108  
109 109 ---
110   -
... ...
docs/搜索API对接指南-05-索引接口(Indexer).md
... ... @@ -13,7 +13,7 @@
13 13 | 查询文档 | POST | `/indexer/documents` | 按 SPU ID 列表查询 ES 文档,不写入 ES |
14 14 | 构建 ES 文档(正式) | POST | `/indexer/build-docs` | 由上游提供 MySQL 行数据,返回 ES-ready 文档,不写 ES |
15 15 | 构建 ES 文档(测试) | POST | `/indexer/build-docs-from-db` | 由本服务查库并构建文档,仅测试/调试用 |
16   -| 内容理解字段生成 | POST | `/indexer/enrich-content` | 根据商品标题批量生成 qanchors、enriched_attributes、tags(供微服务组合方式使用) |
  16 +| 内容理解字段生成 | POST | `/indexer/enrich-content` | 根据商品标题批量生成 qanchors、enriched_attributes、tags(供微服务组合方式使用;独立服务端口 6001) |
17 17 | 索引健康检查 | GET | `/indexer/health` | 检查索引服务与数据库连接状态 |
18 18  
19 19 #### 5.0 支撑外部 indexer 的三种方式
... ... @@ -23,7 +23,7 @@
23 23 | 方式 | 说明 | 适用场景 |
24 24 |------|------|----------|
25 25 | **1)doc 填充接口** | 调用 `POST /indexer/build-docs` 或 `POST /indexer/build-docs-from-db`,由本服务基于 MySQL 行数据构建完整 ES 文档(含多语言、向量、规格等),**不写入 ES**,由调用方自行写入。 | 希望一站式拿到 ES-ready doc,由己方控制写 ES 的时机与索引名。 |
26   -| **2)微服务组合** | 单独调用**翻译**、**向量化**、**内容理解字段生成**等接口,由 indexer 程序自己组装 doc 并写入 ES。翻译与向量化为独立微服务(见第 7 节);内容理解为 Indexer 服务内接口 `POST /indexer/enrich-content`。 | 需要灵活编排、或希望将 LLM/向量等耗时步骤与主链路解耦(如异步补齐 qanchors/tags)。 |
  26 +| **2)微服务组合** | 单独调用**翻译**、**向量化**、**内容理解字段生成**等接口,由 indexer 程序自己组装 doc 并写入 ES。翻译与向量化为独立微服务(见第 7 节);内容理解为 FacetAwareMatching 独立服务接口 `POST /indexer/enrich-content`(端口 6001)。 | 需要灵活编排、或希望将 LLM/向量等耗时步骤与主链路解耦(如异步补齐 qanchors/tags)。 |
27 27 | **3)本服务直接写 ES** | 调用全量索引 `POST /indexer/reindex`、增量索引 `POST /indexer/index`(指定 SPU ID 列表),由本服务从 MySQL 拉数并直接写入 ES。 | 自建运维、联调或不需要由 Java 写 ES 的场景。 |
28 28  
29 29 - **方式 1** 与 **方式 2** 下,ES 的写入方均为外部 indexer(或 Java),职责清晰。
... ... @@ -498,7 +498,7 @@ curl -X GET &quot;http://localhost:6004/indexer/health&quot;
498 498  
499 499 #### 请求示例(完整 curl)
500 500  
501   -> 完整请求体参考 `scripts/test_build_docs_api.py` 中的 `build_sample_request()`。
  501 +> 完整请求体参考 `tests/manual/test_build_docs_api.py` 中的 `build_sample_request()`。
502 502  
503 503 ```bash
504 504 # 单条 SPU 示例(含 spu、skus、options)
... ... @@ -648,13 +648,38 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \
648 648 ### 5.8 内容理解字段生成接口
649 649  
650 650 - **端点**: `POST /indexer/enrich-content`
651   -- **描述**: 根据商品内容信息批量生成 **qanchors**(锚文本)、**enriched_attributes**(语义属性)、**enriched_tags**(细分标签),供外部 indexer 在「微服务组合」方式下自行拼装 doc 时使用。请求以 `items[]` 传入商品内容字段(必填/可选见下表)。接口只暴露商品内容输入,语言选择、分析维度与最终字段结构统一由 `indexer.product_enrich` 内部决定;当前返回结果与 `search_products` mapping 保持一致。单次请求在线程池中执行,避免阻塞其他接口。
  651 +- **服务**: FacetAwareMatching 独立服务(默认端口 **6001**;由 `/data/FacetAwareMatching/scripts/service_ctl.sh` 管理)
  652 +- **描述**: 根据商品内容信息批量生成 **qanchors**(锚文本)、**enriched_attributes**(通用语义属性)、**enriched_tags**(细分标签)、**enriched_taxonomy_attributes**(taxonomy 结构化属性),供外部 indexer 在「微服务组合」方式下自行拼装 doc 时使用。请求以 `items[]` 传入商品内容字段(必填/可选见下表)。接口只暴露商品内容输入,语言选择、分析维度与最终字段结构统一由 FacetAwareMatching 的 `product_enrich` 内部决定;当前返回结果与 `search_products` mapping 保持一致。单次请求在线程池中执行,避免阻塞其他接口。
  653 +
  654 +当前支持的 `category_taxonomy_profile`:
  655 +- `apparel`
  656 +- `3c`
  657 +- `bags`
  658 +- `pet_supplies`
  659 +- `electronics`
  660 +- `outdoor`
  661 +- `home_appliances`
  662 +- `home_living`
  663 +- `wigs`
  664 +- `beauty`
  665 +- `accessories`
  666 +- `toys`
  667 +- `shoes`
  668 +- `sports`
  669 +- `others`
  670 +
  671 +说明:
  672 +- 所有 profile 的 `enriched_taxonomy_attributes.value` 都统一返回 `zh` + `en`。
  673 +- 外部调用 `/indexer/enrich-content` 时,以请求中的 `category_taxonomy_profile` 为准。
  674 +- 若 indexer 内部仍接入内容理解能力,taxonomy profile 请在调用侧显式传入(建议仍以租户行业配置为准)。
652 675  
653 676 #### 请求参数
654 677  
655 678 ```json
656 679 {
657 680 "tenant_id": "170",
  681 + "enrichment_scopes": ["generic", "category_taxonomy"],
  682 + "category_taxonomy_profile": "apparel",
658 683 "items": [
659 684 {
660 685 "spu_id": "223167",
... ... @@ -675,6 +700,8 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \
675 700 | 参数 | 类型 | 必填 | 默认值 | 说明 |
676 701 |------|------|------|--------|------|
677 702 | `tenant_id` | string | Y | - | 租户 ID。目前仅用于记录日志,不产生实际作用|
  703 +| `enrichment_scopes` | array[string] | N | `["generic", "category_taxonomy"]` | 选择要执行的增强范围。`generic` 生成 `qanchors`/`enriched_tags`/`enriched_attributes`,`category_taxonomy` 生成 `enriched_taxonomy_attributes` |
  704 +| `category_taxonomy_profile` | string | N | `apparel` | 品类 taxonomy profile。支持:`apparel`、`3c`、`bags`、`pet_supplies`、`electronics`、`outdoor`、`home_appliances`、`home_living`、`wigs`、`beauty`、`accessories`、`toys`、`shoes`、`sports`、`others` |
678 705 | `items` | array | Y | - | 待分析列表;**单次最多 50 条** |
679 706  
680 707 `items[]` 字段说明:
... ... @@ -683,21 +710,24 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \
683 710 |------|------|------|------|
684 711 | `spu_id` | string | Y | SPU ID,用于回填结果;目前仅用于记录日志,不产生实际作用|
685 712 | `title` | string | Y | 商品标题 |
686   -| `image_url` | string | N | 商品主图 URL;当前会参与内容缓存键,后续可用于图像/多模态内容理解 |
687   -| `brief` | string | N | 商品简介/短描述;当前会参与内容缓存键 |
688   -| `description` | string | N | 商品详情/长描述;当前会参与内容缓存键 |
  713 +| `image_url` | string | N | 商品主图 URL;当前仅透传,暂未参与 prompt 与缓存键,后续可用于图像/多模态内容理解 |
  714 +| `brief` | string | N | 商品简介/短描述;当前会参与 prompt 与缓存键 |
  715 +| `description` | string | N | 商品详情/长描述;当前会参与 prompt 与缓存键 |
689 716  
690 717 缓存说明:
691 718  
692   -- 内容缓存键仅由 `target_lang + items[]` 中会影响内容理解结果的输入文本构成,目前包括:`title`、`brief`、`description`、`image_url` 的规范化内容 hash。
  719 +- 内容缓存按 **增强范围 + taxonomy profile** 拆分;`generic` 与 `category_taxonomy:apparel` 等使用不同缓存命名空间,互不污染、可独立演进。
  720 +- 缓存键由 `analysis_kind + target_lang + prompt/schema 版本指纹 + prompt 输入文本 hash` 构成;对 category taxonomy 来说,profile 会进入 schema 标识与版本指纹。
  721 +- 当前真正参与 prompt 输入的字段是:`title`、`brief`、`description`;这些字段任一变化,都会落到新的缓存 key。
  722 +- `prompt/schema 版本指纹` 会综合 system prompt、shared instruction、localized table headers、result fields、user instruction template 等信息生成;因此只要提示词或输出契约变化,旧缓存会自然失效。
693 723 - `tenant_id`、`spu_id` 只用于请求归属与结果回填,不参与缓存键。
694   -- 因此,输入内容不变时可跨请求直接命中缓存;任一输入字段变化时,会自然落到新的缓存 key。
  724 +- 因此,输入内容与 prompt 契约都不变时可跨请求直接命中缓存;任一一侧变化,都会自然落到新的缓存 key。
695 725  
696 726 语言说明:
697 727  
698 728 - 接口不接受语言控制参数。
699 729 - 返回哪些语言、返回哪些语义维度,统一由 `indexer.product_enrich` 内部逻辑决定。
700   -- 当前为了与 `search_products` mapping 对齐,返回结果只包含核心索引语言 `zh`、`en`。
  730 +- 当前为了与 `search_products` mapping 对齐,通用增强字段与 taxonomy 字段都统一只返回核心索引语言 `zh`、`en`。
701 731  
702 732 批量请求建议:
703 733 - **全量**:强烈建议 尽可能 **20 个 SPU/doc** 攒成一个批次后再请求一次。
... ... @@ -709,6 +739,8 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \
709 739 ```json
710 740 {
711 741 "tenant_id": "170",
  742 + "enrichment_scopes": ["generic", "category_taxonomy"],
  743 + "category_taxonomy_profile": "apparel",
712 744 "total": 2,
713 745 "results": [
714 746 {
... ... @@ -725,6 +757,11 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \
725 757 { "name": "enriched_tags", "value": { "zh": "纯棉" } },
726 758 { "name": "usage_scene", "value": { "zh": "日常" } },
727 759 { "name": "enriched_tags", "value": { "en": "cotton" } }
  760 + ],
  761 + "enriched_taxonomy_attributes": [
  762 + { "name": "Product Type", "value": { "zh": ["T恤"], "en": ["t-shirt"] } },
  763 + { "name": "Target Gender", "value": { "zh": ["男"], "en": ["men"] } },
  764 + { "name": "Season", "value": { "zh": ["夏季"], "en": ["summer"] } }
728 765 ]
729 766 },
730 767 {
... ... @@ -735,7 +772,8 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \
735 772 "enriched_tags": {
736 773 "en": ["dolls", "toys"]
737 774 },
738   - "enriched_attributes": []
  775 + "enriched_attributes": [],
  776 + "enriched_taxonomy_attributes": []
739 777 }
740 778 ]
741 779 }
... ... @@ -743,10 +781,13 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \
743 781  
744 782 | 字段 | 类型 | 说明 |
745 783 |------|------|------|
746   -| `results` | array | 与请求 `items` 一一对应,每项含 `spu_id`、`qanchors`、`enriched_attributes`、`enriched_tags` |
  784 +| `enrichment_scopes` | array | 实际执行的增强范围列表 |
  785 +| `category_taxonomy_profile` | string | 实际使用的品类 taxonomy profile |
  786 +| `results` | array | 与请求 `items` 一一对应,每项含 `spu_id`、`qanchors`、`enriched_attributes`、`enriched_tags`、`enriched_taxonomy_attributes` |
747 787 | `results[].qanchors` | object | 与 ES `qanchors` 字段同结构,按语言键返回短语数组 |
748 788 | `results[].enriched_tags` | object | 与 ES `enriched_tags` 字段同结构,按语言键返回标签数组 |
749 789 | `results[].enriched_attributes` | array | 与 ES `enriched_attributes` nested 字段同结构,每项为 `{ "name", "value": { "zh"?: "...", "en"?: "..." } }` |
  790 +| `results[].enriched_taxonomy_attributes` | array | 与 ES `enriched_taxonomy_attributes` nested 字段同结构。每项通常为 `{ "name", "value": { "zh"?: [...], "en"?: [...] } }` |
750 791 | `results[].error` | string | 若该条处理失败(如 LLM 异常),会在此字段返回错误信息 |
751 792  
752 793 **错误响应**:
... ... @@ -756,10 +797,12 @@ curl -X POST &quot;http://127.0.0.1:6004/indexer/build-docs-from-db&quot; \
756 797 #### 请求示例
757 798  
758 799 ```bash
759   -curl -X POST "http://localhost:6004/indexer/enrich-content" \
  800 +curl -X POST "http://localhost:6001/indexer/enrich-content" \
760 801 -H "Content-Type: application/json" \
761 802 -d '{
762   - "tenant_id": "170",
  803 + "tenant_id": "163",
  804 + "enrichment_scopes": ["generic", "category_taxonomy"],
  805 + "category_taxonomy_profile": "apparel",
763 806 "items": [
764 807 {
765 808 "spu_id": "223167",
... ... @@ -773,4 +816,3 @@ curl -X POST &quot;http://localhost:6004/indexer/enrich-content&quot; \
773 816 ```
774 817  
775 818 ---
776   -
... ...
docs/搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md
... ... @@ -444,7 +444,7 @@ curl &quot;http://localhost:6006/health&quot;
444 444  
445 445 - **Base URL**: Indexer 服务地址,如 `http://localhost:6004`
446 446 - **路径**: `POST /indexer/enrich-content`
447   -- **说明**: 根据商品标题批量生成 `qanchors`、`enriched_attributes`、`tags`,用于拼装 ES 文档。内部使用大模型(需配置 `DASHSCOPE_API_KEY`),支持多语言与 Redis 缓存;单次最多 50 条,建议批量调用以提升效率。
  447 +- **说明**: 根据商品标题批量生成 `qanchors`、`enriched_attributes`、`enriched_tags`、`enriched_taxonomy_attributes`,用于拼装 ES 文档。支持通过 `enrichment_scopes` 选择执行 `generic` / `category_taxonomy`,并通过 `category_taxonomy_profile` 选择对应大类的 taxonomy prompt/profile;默认执行 `generic + category_taxonomy(apparel)`。当前支持的 taxonomy profile 包括 `apparel`、`3c`、`bags`、`pet_supplies`、`electronics`、`outdoor`、`home_appliances`、`home_living`、`wigs`、`beauty`、`accessories`、`toys`、`shoes`、`sports`、`others`。所有 profile 的 taxonomy 输出都统一返回 `zh` + `en`,`category_taxonomy_profile` 只决定字段集合。内部使用大模型(需配置 `DASHSCOPE_API_KEY`),支持多语言与 Redis 缓存;单次最多 50 条,建议批量调用以提升效率。
448 448  
449 449 请求/响应格式、示例及错误码见 [-05-索引接口(Indexer)](./搜索API对接指南-05-索引接口(Indexer).md#58-内容理解字段生成接口)。
450 450  
... ...
docs/搜索API对接指南-10-接口级压测脚本.md
... ... @@ -4,7 +4,7 @@
4 4  
5 5 ## 10. 接口级压测脚本
6 6  
7   -仓库提供统一压测脚本:`scripts/perf_api_benchmark.py`,用于对以下接口做并发压测:
  7 +仓库提供统一压测脚本:`benchmarks/perf_api_benchmark.py`,用于对以下接口做并发压测:
8 8  
9 9 - 后端搜索:`POST /search/`
10 10 - 搜索建议:`GET /search/suggestions`
... ... @@ -18,21 +18,21 @@
18 18  
19 19 ```bash
20 20 # suggest 压测(tenant 162)
21   -python scripts/perf_api_benchmark.py \
  21 +python benchmarks/perf_api_benchmark.py \
22 22 --scenario backend_suggest \
23 23 --tenant-id 162 \
24 24 --duration 30 \
25 25 --concurrency 50
26 26  
27 27 # search 压测
28   -python scripts/perf_api_benchmark.py \
  28 +python benchmarks/perf_api_benchmark.py \
29 29 --scenario backend_search \
30 30 --tenant-id 162 \
31 31 --duration 30 \
32 32 --concurrency 20
33 33  
34 34 # 全链路压测(search + suggest + embedding + translate + rerank)
35   -python scripts/perf_api_benchmark.py \
  35 +python benchmarks/perf_api_benchmark.py \
36 36 --scenario all \
37 37 --tenant-id 162 \
38 38 --duration 60 \
... ... @@ -45,17 +45,16 @@ python scripts/perf_api_benchmark.py \
45 45 可通过 `--cases-file` 覆盖默认请求模板。示例文件:
46 46  
47 47 ```bash
48   -scripts/perf_cases.json.example
  48 +benchmarks/perf_cases.json.example
49 49 ```
50 50  
51 51 执行示例:
52 52  
53 53 ```bash
54   -python scripts/perf_api_benchmark.py \
  54 +python benchmarks/perf_api_benchmark.py \
55 55 --scenario all \
56 56 --tenant-id 162 \
57   - --cases-file scripts/perf_cases.json.example \
  57 + --cases-file benchmarks/perf_cases.json.example \
58 58 --duration 60 \
59 59 --concurrency 40
60 60 ```
61   -
... ...
docs/相关性检索优化说明.md
... ... @@ -330,7 +330,7 @@ python -m pytest -q tests/test_rerank_client.py tests/test_es_query_builder.py t
330 330 ./scripts/service_ctl.sh restart backend
331 331 sleep 3
332 332 ./scripts/service_ctl.sh status backend
333   -./scripts/evaluation/start_eval.sh.sh batch
  333 +./scripts/evaluation/start_eval.sh batch
334 334 ```
335 335  
336 336 评估产物在 `artifacts/search_evaluation/`(如 `search_eval.sqlite3`、`batch_reports/` 下的 JSON/Markdown)。流程与参数说明见 [scripts/evaluation/README.md](../scripts/evaluation/README.md)。
... ... @@ -895,4 +895,3 @@ rerank_score:0.4784
895 895 rerank_score:0.5849
896 896 "zh": "新款女士修身仿旧牛仔短裤 – 休闲性感磨边水洗牛仔短裤,时尚舒",
897 897 "en": "New Women's Slim-fit Vintage Washed Denim Shorts – Casual Sexy Frayed Hem, Fashionable & Comfortable"
898   -
... ...
docs/缓存与Redis使用说明.md
... ... @@ -196,18 +196,25 @@ services:
196 196 - 配置项:
197 197 - `ANCHOR_CACHE_PREFIX = REDIS_CONFIG.get("anchor_cache_prefix", "product_anchors")`
198 198 - `ANCHOR_CACHE_EXPIRE_DAYS = int(REDIS_CONFIG.get("anchor_cache_expire_days", 30))`
199   -- Key 构造函数:`_make_anchor_cache_key(title, target_lang, tenant_id)`
  199 +- Key 构造函数:`_make_analysis_cache_key(product, target_lang, analysis_kind)`
200 200 - 模板:
201 201  
202 202 ```text
203   -{ANCHOR_CACHE_PREFIX}:{tenant_or_global}:{target_lang}:{md5(title)}
  203 +{ANCHOR_CACHE_PREFIX}:{analysis_kind}:{prompt_contract_hash}:{target_lang}:{prompt_input_prefix}{md5(prompt_input)}
204 204 ```
205 205  
206 206 - 字段说明:
207 207 - `ANCHOR_CACHE_PREFIX`:默认 `"product_anchors"`,可通过 `.env` 中的 `REDIS_ANCHOR_CACHE_PREFIX`(若存在)间接配置到 `REDIS_CONFIG`;
208   - - `tenant_or_global`:`tenant_id` 去空白后的字符串,若为空则使用 `"global"`;
  208 + - `analysis_kind`:分析族,目前至少包括 `content` 与 `taxonomy`,两者缓存隔离;
  209 + - `prompt_contract_hash`:基于 system prompt、shared instruction、localized headers、result fields、user instruction template、schema cache version 等生成的短 hash;只要提示词或输出契约变化,缓存会自动失效;
209 210 - `target_lang`:内容理解输出语言,例如 `zh`;
210   - - `md5(title)`:对原始商品标题(UTF-8)做 MD5。
  211 + - `prompt_input_prefix + md5(prompt_input)`:对真正送入 prompt 的商品文本做前缀 + MD5;当前 prompt 输入来自 `title`、`brief`、`description` 的规范化拼接结果。
  212 +
  213 +设计原则:
  214 +
  215 +- 只让**实际影响 LLM 输出**的输入参与 key;
  216 +- 不让 `tenant_id`、`spu_id` 这类“结果归属信息”污染缓存;
  217 +- prompt 或 schema 变更时,不依赖人工清理 Redis,也能自然切换到新 key。
211 218  
212 219 ### 4.2 Value 与类型
213 220  
... ... @@ -229,6 +236,7 @@ services:
229 236 ```
230 237  
231 238 - 读取时通过 `json.loads(raw)` 还原为 `Dict[str, Any]`。
  239 +- `content` 与 `taxonomy` 的 value 结构会随各自 schema 不同而不同,但都会先通过统一的 normalize 逻辑再写缓存。
232 240  
233 241 ### 4.3 过期策略
234 242  
... ...
embeddings/README.md
... ... @@ -98,10 +98,10 @@
98 98  
99 99 ### 性能与压测(沿用仓库脚本)
100 100  
101   -- 接口级压测(与 `perf_reports/2026-03-12/matrix_report/` 等方法一致):`scripts/perf_api_benchmark.py`
102   - - 示例:`python scripts/perf_api_benchmark.py --scenario embed_text --duration 30 --concurrency 20`
  101 +- 接口级压测(与 `perf_reports/2026-03-12/matrix_report/` 等方法一致):`benchmarks/perf_api_benchmark.py`
  102 + - 示例:`python benchmarks/perf_api_benchmark.py --scenario embed_text --duration 30 --concurrency 20`
103 103 - 文本/图片向量可带 `priority`(与线上 admission 语义一致):`--embed-text-priority 1`、`--embed-image-priority 1`
104   - - 自定义请求模板:`--cases-file scripts/perf_cases.json.example`
  104 + - 自定义请求模板:`--cases-file benchmarks/perf_cases.json.example`
105 105 - 历史矩阵结果与说明见 `perf_reports/2026-03-12/matrix_report/summary.md`。
106 106  
107 107 ### 启动服务
... ...
frontend/static/js/app.js
... ... @@ -316,7 +316,10 @@ async function performSearch(page = 1) {
316 316 document.getElementById('productGrid').innerHTML = '';
317 317  
318 318 try {
319   - const response = await fetch(`${API_BASE_URL}/search/`, {
  319 + const searchUrl = new URL(`${API_BASE_URL}/search/`, window.location.origin);
  320 + searchUrl.searchParams.set('tenant_id', tenantId);
  321 +
  322 + const response = await fetch(searchUrl.toString(), {
320 323 method: 'POST',
321 324 headers: {
322 325 'Content-Type': 'application/json',
... ...
indexer/README.md
... ... @@ -8,7 +8,7 @@
8 8  
9 9 ### 1.1 系统角色划分
10 10  
11   -- **Java 索引程序(/home/tw/saas-server)**
  11 +- **Java 索引程序**
12 12 - 负责“**什么时候、对哪些 SPU 做索引**”(调度 & 触发)。
13 13 - 负责**商品/店铺/类目等基础数据同步**(写 MySQL)。
14 14 - 负责**多租户环境下的全量/增量索引调度**,但不再关心具体 doc 字段细节。
... ...
indexer/Untitled 0 → 100644
... ... @@ -0,0 +1 @@
  1 +taxonomy
0 2 \ No newline at end of file
... ...
indexer/document_transformer.py
... ... @@ -242,6 +242,7 @@ class SPUDocumentTransformer:
242 242 - qanchors.{lang}
243 243 - enriched_tags.{lang}
244 244 - enriched_attributes[].value.{lang}
  245 + - enriched_taxonomy_attributes[].value.{lang}
245 246  
246 247 设计目标:
247 248 - 尽可能攒批调用 LLM;
... ... @@ -273,7 +274,12 @@ class SPUDocumentTransformer:
273 274  
274 275 tenant_id = str(docs[0].get("tenant_id") or "").strip() or None
275 276 try:
276   - results = build_index_content_fields(items=items, tenant_id=tenant_id)
  277 + # TODO: 从数据库读取该 tenant 的真实行业,并据此替换当前默认的 apparel profile。
  278 + results = build_index_content_fields(
  279 + items=items,
  280 + tenant_id=tenant_id,
  281 + category_taxonomy_profile="apparel",
  282 + )
277 283 except Exception as e:
278 284 logger.warning("LLM batch attribute fill failed: %s", e)
279 285 return
... ... @@ -296,6 +302,8 @@ class SPUDocumentTransformer:
296 302 doc["enriched_tags"] = enrichment["enriched_tags"]
297 303 if enrichment.get("enriched_attributes"):
298 304 doc["enriched_attributes"] = enrichment["enriched_attributes"]
  305 + if enrichment.get("enriched_taxonomy_attributes"):
  306 + doc["enriched_taxonomy_attributes"] = enrichment["enriched_taxonomy_attributes"]
299 307 except Exception as e:
300 308 logger.warning("Failed to apply enrichment to doc (spu_id=%s): %s", doc.get("spu_id"), e)
301 309  
... ... @@ -666,6 +674,7 @@ class SPUDocumentTransformer:
666 674  
667 675 tenant_id = doc.get("tenant_id")
668 676 try:
  677 + # TODO: 从数据库读取该 tenant 的真实行业,并据此替换当前默认的 apparel profile。
669 678 results = build_index_content_fields(
670 679 items=[
671 680 {
... ... @@ -677,6 +686,7 @@ class SPUDocumentTransformer:
677 686 }
678 687 ],
679 688 tenant_id=str(tenant_id),
  689 + category_taxonomy_profile="apparel",
680 690 )
681 691 except Exception as e:
682 692 logger.warning("LLM attribute fill failed for SPU %s: %s", spu_id, e)
... ...
indexer/product_enrich.py
... ... @@ -14,10 +14,11 @@ import time
14 14 import hashlib
15 15 import uuid
16 16 import threading
  17 +from dataclasses import dataclass, field
17 18 from collections import OrderedDict
18 19 from datetime import datetime
19 20 from concurrent.futures import ThreadPoolExecutor
20   -from typing import List, Dict, Tuple, Any, Optional
  21 +from typing import List, Dict, Tuple, Any, Optional, FrozenSet
21 22  
22 23 import redis
23 24 import requests
... ... @@ -30,6 +31,7 @@ from indexer.product_enrich_prompts import (
30 31 USER_INSTRUCTION_TEMPLATE,
31 32 LANGUAGE_MARKDOWN_TABLE_HEADERS,
32 33 SHARED_ANALYSIS_INSTRUCTION,
  34 + CATEGORY_TAXONOMY_PROFILES,
33 35 )
34 36  
35 37 # 配置
... ... @@ -144,10 +146,26 @@ if _missing_prompt_langs:
144 146 )
145 147  
146 148  
147   -# 多值字段分隔:英文逗号、中文逗号、顿号,及历史约定的 ; | / 与空白
  149 +# 多值字段分隔
148 150 _MULTI_VALUE_FIELD_SPLIT_RE = re.compile(r"[,、,;|/\n\t]+")
  151 +# 表格单元格中视为「无内容」的占位
  152 +_MARKDOWN_EMPTY_CELL_LITERALS: Tuple[str, ...] = ("-","–", "—", "none", "null", "n/a", "无")
  153 +_MARKDOWN_EMPTY_CELL_TOKENS_CF: FrozenSet[str] = frozenset(
  154 + lit.casefold() for lit in _MARKDOWN_EMPTY_CELL_LITERALS
  155 +)
  156 +
  157 +def _normalize_markdown_table_cell(raw: Optional[str]) -> str:
  158 + """strip;将占位符统一视为空字符串。"""
  159 + s = str(raw or "").strip()
  160 + if not s:
  161 + return ""
  162 + if s.casefold() in _MARKDOWN_EMPTY_CELL_TOKENS_CF:
  163 + return ""
  164 + return s
149 165 _CORE_INDEX_LANGUAGES = ("zh", "en")
150   -_ANALYSIS_ATTRIBUTE_FIELD_MAP = (
  166 +_DEFAULT_ENRICHMENT_SCOPES = ("generic", "category_taxonomy")
  167 +_DEFAULT_CATEGORY_TAXONOMY_PROFILE = "apparel"
  168 +_CONTENT_ANALYSIS_ATTRIBUTE_FIELD_MAP = (
151 169 ("tags", "enriched_tags"),
152 170 ("target_audience", "target_audience"),
153 171 ("usage_scene", "usage_scene"),
... ... @@ -156,7 +174,7 @@ _ANALYSIS_ATTRIBUTE_FIELD_MAP = (
156 174 ("material", "material"),
157 175 ("features", "features"),
158 176 )
159   -_ANALYSIS_RESULT_FIELDS = (
  177 +_CONTENT_ANALYSIS_RESULT_FIELDS = (
160 178 "title",
161 179 "category_path",
162 180 "tags",
... ... @@ -168,7 +186,7 @@ _ANALYSIS_RESULT_FIELDS = (
168 186 "features",
169 187 "anchor_text",
170 188 )
171   -_ANALYSIS_MEANINGFUL_FIELDS = (
  189 +_CONTENT_ANALYSIS_MEANINGFUL_FIELDS = (
172 190 "tags",
173 191 "target_audience",
174 192 "usage_scene",
... ... @@ -178,9 +196,111 @@ _ANALYSIS_MEANINGFUL_FIELDS = (
178 196 "features",
179 197 "anchor_text",
180 198 )
181   -_ANALYSIS_FIELD_ALIASES = {
  199 +_CONTENT_ANALYSIS_FIELD_ALIASES = {
182 200 "tags": ("tags", "enriched_tags"),
183 201 }
  202 +_CONTENT_ANALYSIS_QUALITY_FIELDS = ("title", "category_path", "anchor_text")
  203 +
  204 +
  205 +@dataclass(frozen=True)
  206 +class AnalysisSchema:
  207 + name: str
  208 + shared_instruction: str
  209 + markdown_table_headers: Dict[str, List[str]]
  210 + result_fields: Tuple[str, ...]
  211 + meaningful_fields: Tuple[str, ...]
  212 + cache_version: str = "v1"
  213 + field_aliases: Dict[str, Tuple[str, ...]] = field(default_factory=dict)
  214 + quality_fields: Tuple[str, ...] = ()
  215 +
  216 + def get_headers(self, target_lang: str) -> Optional[List[str]]:
  217 + return self.markdown_table_headers.get(target_lang)
  218 +
  219 +
  220 +_ANALYSIS_SCHEMAS: Dict[str, AnalysisSchema] = {
  221 + "content": AnalysisSchema(
  222 + name="content",
  223 + shared_instruction=SHARED_ANALYSIS_INSTRUCTION,
  224 + markdown_table_headers=LANGUAGE_MARKDOWN_TABLE_HEADERS,
  225 + result_fields=_CONTENT_ANALYSIS_RESULT_FIELDS,
  226 + meaningful_fields=_CONTENT_ANALYSIS_MEANINGFUL_FIELDS,
  227 + cache_version="v2",
  228 + field_aliases=_CONTENT_ANALYSIS_FIELD_ALIASES,
  229 + quality_fields=_CONTENT_ANALYSIS_QUALITY_FIELDS,
  230 + ),
  231 +}
  232 +
  233 +def _build_taxonomy_profile_schema(profile: str, config: Dict[str, Any]) -> AnalysisSchema:
  234 + return AnalysisSchema(
  235 + name=f"taxonomy:{profile}",
  236 + shared_instruction=config["shared_instruction"],
  237 + markdown_table_headers=config["markdown_table_headers"],
  238 + result_fields=tuple(field["key"] for field in config["fields"]),
  239 + meaningful_fields=tuple(field["key"] for field in config["fields"]),
  240 + cache_version="v1",
  241 + )
  242 +
  243 +
  244 +_CATEGORY_TAXONOMY_PROFILE_SCHEMAS: Dict[str, AnalysisSchema] = {
  245 + profile: _build_taxonomy_profile_schema(profile, config)
  246 + for profile, config in CATEGORY_TAXONOMY_PROFILES.items()
  247 +}
  248 +
  249 +_CATEGORY_TAXONOMY_PROFILE_ATTRIBUTE_FIELD_MAPS: Dict[str, Tuple[Tuple[str, str], ...]] = {
  250 + profile: tuple((field["key"], field["label"]) for field in config["fields"])
  251 + for profile, config in CATEGORY_TAXONOMY_PROFILES.items()
  252 +}
  253 +
  254 +
  255 +def get_supported_category_taxonomy_profiles() -> Tuple[str, ...]:
  256 + return tuple(_CATEGORY_TAXONOMY_PROFILE_SCHEMAS.keys())
  257 +
  258 +
  259 +def _normalize_category_taxonomy_profile(category_taxonomy_profile: Optional[str] = None) -> str:
  260 + profile = str(category_taxonomy_profile or _DEFAULT_CATEGORY_TAXONOMY_PROFILE).strip()
  261 + if profile not in _CATEGORY_TAXONOMY_PROFILE_SCHEMAS:
  262 + supported = ", ".join(get_supported_category_taxonomy_profiles())
  263 + raise ValueError(
  264 + f"Unsupported category_taxonomy_profile: {profile}. Supported profiles: {supported}"
  265 + )
  266 + return profile
  267 +
  268 +
  269 +def _get_analysis_schema(
  270 + analysis_kind: str,
  271 + *,
  272 + category_taxonomy_profile: Optional[str] = None,
  273 +) -> AnalysisSchema:
  274 + if analysis_kind == "content":
  275 + return _ANALYSIS_SCHEMAS["content"]
  276 + if analysis_kind == "taxonomy":
  277 + profile = _normalize_category_taxonomy_profile(category_taxonomy_profile)
  278 + return _CATEGORY_TAXONOMY_PROFILE_SCHEMAS[profile]
  279 + raise ValueError(f"Unsupported analysis_kind: {analysis_kind}")
  280 +
  281 +
  282 +def _get_taxonomy_attribute_field_map(
  283 + category_taxonomy_profile: Optional[str] = None,
  284 +) -> Tuple[Tuple[str, str], ...]:
  285 + profile = _normalize_category_taxonomy_profile(category_taxonomy_profile)
  286 + return _CATEGORY_TAXONOMY_PROFILE_ATTRIBUTE_FIELD_MAPS[profile]
  287 +
  288 +
  289 +def _normalize_enrichment_scopes(
  290 + enrichment_scopes: Optional[List[str]] = None,
  291 +) -> Tuple[str, ...]:
  292 + requested = _DEFAULT_ENRICHMENT_SCOPES if not enrichment_scopes else tuple(enrichment_scopes)
  293 + normalized: List[str] = []
  294 + seen = set()
  295 + for enrichment_scope in requested:
  296 + scope = str(enrichment_scope).strip()
  297 + if scope not in {"generic", "category_taxonomy"}:
  298 + raise ValueError(f"Unsupported enrichment_scope: {scope}")
  299 + if scope in seen:
  300 + continue
  301 + seen.add(scope)
  302 + normalized.append(scope)
  303 + return tuple(normalized)
184 304  
185 305  
186 306 def split_multi_value_field(text: Optional[str]) -> List[str]:
... ... @@ -235,12 +355,12 @@ def _get_product_id(product: Dict[str, Any]) -&gt; str:
235 355 return str(product.get("id") or product.get("spu_id") or "").strip()
236 356  
237 357  
238   -def _get_analysis_field_aliases(field_name: str) -> Tuple[str, ...]:
239   - return _ANALYSIS_FIELD_ALIASES.get(field_name, (field_name,))
  358 +def _get_analysis_field_aliases(field_name: str, schema: AnalysisSchema) -> Tuple[str, ...]:
  359 + return schema.field_aliases.get(field_name, (field_name,))
240 360  
241 361  
242   -def _get_analysis_field_value(row: Dict[str, Any], field_name: str) -> Any:
243   - for alias in _get_analysis_field_aliases(field_name):
  362 +def _get_analysis_field_value(row: Dict[str, Any], field_name: str, schema: AnalysisSchema) -> Any:
  363 + for alias in _get_analysis_field_aliases(field_name, schema):
244 364 if alias in row:
245 365 return row.get(alias)
246 366 return None
... ... @@ -261,6 +381,7 @@ def _has_meaningful_value(value: Any) -&gt; bool:
261 381 def _make_empty_analysis_result(
262 382 product: Dict[str, Any],
263 383 target_lang: str,
  384 + schema: AnalysisSchema,
264 385 error: Optional[str] = None,
265 386 ) -> Dict[str, Any]:
266 387 result = {
... ... @@ -268,7 +389,7 @@ def _make_empty_analysis_result(
268 389 "lang": target_lang,
269 390 "title_input": str(product.get("title") or "").strip(),
270 391 }
271   - for field in _ANALYSIS_RESULT_FIELDS:
  392 + for field in schema.result_fields:
272 393 result[field] = ""
273 394 if error:
274 395 result["error"] = error
... ... @@ -279,42 +400,59 @@ def _normalize_analysis_result(
279 400 result: Dict[str, Any],
280 401 product: Dict[str, Any],
281 402 target_lang: str,
  403 + schema: AnalysisSchema,
282 404 ) -> Dict[str, Any]:
283   - normalized = _make_empty_analysis_result(product, target_lang)
  405 + normalized = _make_empty_analysis_result(product, target_lang, schema)
284 406 if not isinstance(result, dict):
285 407 return normalized
286 408  
287 409 normalized["lang"] = str(result.get("lang") or target_lang).strip() or target_lang
288   - normalized["title"] = str(result.get("title") or "").strip()
289   - normalized["category_path"] = str(result.get("category_path") or "").strip()
290 410 normalized["title_input"] = str(
291 411 product.get("title") or result.get("title_input") or ""
292 412 ).strip()
293 413  
294   - for field in _ANALYSIS_RESULT_FIELDS:
295   - if field in {"title", "category_path"}:
296   - continue
297   - normalized[field] = str(_get_analysis_field_value(result, field) or "").strip()
  414 + for field in schema.result_fields:
  415 + normalized[field] = str(_get_analysis_field_value(result, field, schema) or "").strip()
298 416  
299 417 if result.get("error"):
300 418 normalized["error"] = str(result.get("error"))
301 419 return normalized
302 420  
303 421  
304   -def _has_meaningful_analysis_content(result: Dict[str, Any]) -> bool:
305   - return any(_has_meaningful_value(result.get(field)) for field in _ANALYSIS_MEANINGFUL_FIELDS)
  422 +def _has_meaningful_analysis_content(result: Dict[str, Any], schema: AnalysisSchema) -> bool:
  423 + return any(_has_meaningful_value(result.get(field)) for field in schema.meaningful_fields)
  424 +
  425 +
  426 +def _append_analysis_attributes(
  427 + target: List[Dict[str, Any]],
  428 + row: Dict[str, Any],
  429 + lang: str,
  430 + schema: AnalysisSchema,
  431 + field_map: Tuple[Tuple[str, str], ...],
  432 +) -> None:
  433 + for source_name, output_name in field_map:
  434 + raw = _get_analysis_field_value(row, source_name, schema)
  435 + if not raw:
  436 + continue
  437 + _append_named_lang_phrase_map(
  438 + target,
  439 + name=output_name,
  440 + lang=lang,
  441 + raw_value=raw,
  442 + )
306 443  
307 444  
308 445 def _apply_index_content_row(result: Dict[str, Any], row: Dict[str, Any], lang: str) -> None:
309 446 if not row or row.get("error"):
310 447 return
311 448  
312   - anchor_text = str(_get_analysis_field_value(row, "anchor_text") or "").strip()
  449 + content_schema = _get_analysis_schema("content")
  450 + anchor_text = str(_get_analysis_field_value(row, "anchor_text", content_schema) or "").strip()
313 451 if anchor_text:
314 452 _append_lang_phrase_map(result["qanchors"], lang=lang, raw_value=anchor_text)
315 453  
316   - for source_name, output_name in _ANALYSIS_ATTRIBUTE_FIELD_MAP:
317   - raw = _get_analysis_field_value(row, source_name)
  454 + for source_name, output_name in _CONTENT_ANALYSIS_ATTRIBUTE_FIELD_MAP:
  455 + raw = _get_analysis_field_value(row, source_name, content_schema)
318 456 if not raw:
319 457 continue
320 458 _append_named_lang_phrase_map(
... ... @@ -327,6 +465,28 @@ def _apply_index_content_row(result: Dict[str, Any], row: Dict[str, Any], lang:
327 465 _append_lang_phrase_map(result["enriched_tags"], lang=lang, raw_value=raw)
328 466  
329 467  
  468 +def _apply_index_taxonomy_row(
  469 + result: Dict[str, Any],
  470 + row: Dict[str, Any],
  471 + lang: str,
  472 + *,
  473 + category_taxonomy_profile: Optional[str] = None,
  474 +) -> None:
  475 + if not row or row.get("error"):
  476 + return
  477 +
  478 + _append_analysis_attributes(
  479 + result["enriched_taxonomy_attributes"],
  480 + row=row,
  481 + lang=lang,
  482 + schema=_get_analysis_schema(
  483 + "taxonomy",
  484 + category_taxonomy_profile=category_taxonomy_profile,
  485 + ),
  486 + field_map=_get_taxonomy_attribute_field_map(category_taxonomy_profile),
  487 + )
  488 +
  489 +
330 490 def _normalize_index_content_item(item: Dict[str, Any]) -> Dict[str, str]:
331 491 item_id = _get_product_id(item)
332 492 return {
... ... @@ -341,6 +501,8 @@ def _normalize_index_content_item(item: Dict[str, Any]) -&gt; Dict[str, str]:
341 501 def build_index_content_fields(
342 502 items: List[Dict[str, Any]],
343 503 tenant_id: Optional[str] = None,
  504 + enrichment_scopes: Optional[List[str]] = None,
  505 + category_taxonomy_profile: Optional[str] = None,
344 506 ) -> List[Dict[str, Any]]:
345 507 """
346 508 高层入口:生成与 ES mapping 对齐的内容理解字段。
... ... @@ -349,18 +511,23 @@ def build_index_content_fields(
349 511 - `id` 或 `spu_id`
350 512 - `title`
351 513 - 可选 `brief` / `description` / `image_url`
  514 + - 可选 `enrichment_scopes`,默认同时执行 `generic` 与 `category_taxonomy`
  515 + - 可选 `category_taxonomy_profile`,默认 `apparel`
352 516  
353 517 返回项结构:
354 518 - `id`
355 519 - `qanchors`
356 520 - `enriched_tags`
357 521 - `enriched_attributes`
  522 + - `enriched_taxonomy_attributes`
358 523 - 可选 `error`
359 524  
360 525 其中:
361 526 - `qanchors.{lang}` 为短语数组
362 527 - `enriched_tags.{lang}` 为标签数组
363 528 """
  529 + requested_enrichment_scopes = _normalize_enrichment_scopes(enrichment_scopes)
  530 + normalized_taxonomy_profile = _normalize_category_taxonomy_profile(category_taxonomy_profile)
364 531 normalized_items = [_normalize_index_content_item(item) for item in items]
365 532 if not normalized_items:
366 533 return []
... ... @@ -371,32 +538,72 @@ def build_index_content_fields(
371 538 "qanchors": {},
372 539 "enriched_tags": {},
373 540 "enriched_attributes": [],
  541 + "enriched_taxonomy_attributes": [],
374 542 }
375 543 for item in normalized_items
376 544 }
377 545  
378 546 for lang in _CORE_INDEX_LANGUAGES:
379   - try:
380   - rows = analyze_products(
381   - products=normalized_items,
382   - target_lang=lang,
383   - batch_size=BATCH_SIZE,
384   - tenant_id=tenant_id,
385   - )
386   - except Exception as e:
387   - logger.warning("build_index_content_fields failed for lang=%s: %s", lang, e)
388   - for item in normalized_items:
389   - results_by_id[item["id"]].setdefault("error", str(e))
390   - continue
391   -
392   - for row in rows or []:
393   - item_id = str(row.get("id") or "").strip()
394   - if not item_id or item_id not in results_by_id:
  547 + if "generic" in requested_enrichment_scopes:
  548 + try:
  549 + rows = analyze_products(
  550 + products=normalized_items,
  551 + target_lang=lang,
  552 + batch_size=BATCH_SIZE,
  553 + tenant_id=tenant_id,
  554 + analysis_kind="content",
  555 + category_taxonomy_profile=normalized_taxonomy_profile,
  556 + )
  557 + except Exception as e:
  558 + logger.warning("build_index_content_fields content enrichment failed for lang=%s: %s", lang, e)
  559 + for item in normalized_items:
  560 + results_by_id[item["id"]].setdefault("error", str(e))
395 561 continue
396   - if row.get("error"):
397   - results_by_id[item_id].setdefault("error", row["error"])
  562 +
  563 + for row in rows or []:
  564 + item_id = str(row.get("id") or "").strip()
  565 + if not item_id or item_id not in results_by_id:
  566 + continue
  567 + if row.get("error"):
  568 + results_by_id[item_id].setdefault("error", row["error"])
  569 + continue
  570 + _apply_index_content_row(results_by_id[item_id], row=row, lang=lang)
  571 +
  572 + if "category_taxonomy" in requested_enrichment_scopes:
  573 + for lang in _CORE_INDEX_LANGUAGES:
  574 + try:
  575 + taxonomy_rows = analyze_products(
  576 + products=normalized_items,
  577 + target_lang=lang,
  578 + batch_size=BATCH_SIZE,
  579 + tenant_id=tenant_id,
  580 + analysis_kind="taxonomy",
  581 + category_taxonomy_profile=normalized_taxonomy_profile,
  582 + )
  583 + except Exception as e:
  584 + logger.warning(
  585 + "build_index_content_fields taxonomy enrichment failed for profile=%s lang=%s: %s",
  586 + normalized_taxonomy_profile,
  587 + lang,
  588 + e,
  589 + )
  590 + for item in normalized_items:
  591 + results_by_id[item["id"]].setdefault("error", str(e))
398 592 continue
399   - _apply_index_content_row(results_by_id[item_id], row=row, lang=lang)
  593 +
  594 + for row in taxonomy_rows or []:
  595 + item_id = str(row.get("id") or "").strip()
  596 + if not item_id or item_id not in results_by_id:
  597 + continue
  598 + if row.get("error"):
  599 + results_by_id[item_id].setdefault("error", row["error"])
  600 + continue
  601 + _apply_index_taxonomy_row(
  602 + results_by_id[item_id],
  603 + row=row,
  604 + lang=lang,
  605 + category_taxonomy_profile=normalized_taxonomy_profile,
  606 + )
400 607  
401 608 return [results_by_id[item["id"]] for item in normalized_items]
402 609  
... ... @@ -463,52 +670,129 @@ def _build_prompt_input_text(product: Dict[str, Any]) -&gt; str:
463 670 return _truncate_by_words(candidate, PROMPT_INPUT_MAX_WORDS)
464 671  
465 672  
466   -def _make_anchor_cache_key(
  673 +def _make_analysis_cache_key(
467 674 product: Dict[str, Any],
468 675 target_lang: str,
  676 + analysis_kind: str,
  677 + category_taxonomy_profile: Optional[str] = None,
469 678 ) -> str:
470   - """构造缓存 key,仅由 prompt 实际输入文本内容 + 目标语言决定。"""
  679 + """构造缓存 key,仅由分析类型、prompt 实际输入文本内容与目标语言决定。"""
  680 + schema = _get_analysis_schema(
  681 + analysis_kind,
  682 + category_taxonomy_profile=category_taxonomy_profile,
  683 + )
471 684 prompt_input = _build_prompt_input_text(product)
472 685 h = hashlib.md5(prompt_input.encode("utf-8")).hexdigest()
473   - return f"{ANCHOR_CACHE_PREFIX}:{target_lang}:{prompt_input[:4]}{h}"
  686 + prompt_contract = {
  687 + "schema_name": schema.name,
  688 + "cache_version": schema.cache_version,
  689 + "system_message": SYSTEM_MESSAGE,
  690 + "user_instruction_template": USER_INSTRUCTION_TEMPLATE,
  691 + "shared_instruction": schema.shared_instruction,
  692 + "assistant_headers": schema.get_headers(target_lang),
  693 + "result_fields": schema.result_fields,
  694 + "meaningful_fields": schema.meaningful_fields,
  695 + "field_aliases": schema.field_aliases,
  696 + }
  697 + prompt_contract_hash = hashlib.md5(
  698 + json.dumps(prompt_contract, ensure_ascii=False, sort_keys=True).encode("utf-8")
  699 + ).hexdigest()[:12]
  700 + return (
  701 + f"{ANCHOR_CACHE_PREFIX}:{analysis_kind}:{prompt_contract_hash}:"
  702 + f"{target_lang}:{prompt_input[:4]}{h}"
  703 + )
474 704  
475 705  
476   -def _get_cached_anchor_result(
  706 +def _make_anchor_cache_key(
477 707 product: Dict[str, Any],
478 708 target_lang: str,
  709 +) -> str:
  710 + return _make_analysis_cache_key(product, target_lang, analysis_kind="content")
  711 +
  712 +
  713 +def _get_cached_analysis_result(
  714 + product: Dict[str, Any],
  715 + target_lang: str,
  716 + analysis_kind: str,
  717 + category_taxonomy_profile: Optional[str] = None,
479 718 ) -> Optional[Dict[str, Any]]:
480 719 if not _anchor_redis:
481 720 return None
  721 + schema = _get_analysis_schema(
  722 + analysis_kind,
  723 + category_taxonomy_profile=category_taxonomy_profile,
  724 + )
482 725 try:
483   - key = _make_anchor_cache_key(product, target_lang)
  726 + key = _make_analysis_cache_key(
  727 + product,
  728 + target_lang,
  729 + analysis_kind,
  730 + category_taxonomy_profile=category_taxonomy_profile,
  731 + )
484 732 raw = _anchor_redis.get(key)
485 733 if not raw:
486 734 return None
487   - result = _normalize_analysis_result(json.loads(raw), product=product, target_lang=target_lang)
488   - if not _has_meaningful_analysis_content(result):
  735 + result = _normalize_analysis_result(
  736 + json.loads(raw),
  737 + product=product,
  738 + target_lang=target_lang,
  739 + schema=schema,
  740 + )
  741 + if not _has_meaningful_analysis_content(result, schema):
489 742 return None
490 743 return result
491 744 except Exception as e:
492   - logger.warning(f"Failed to get anchor cache: {e}")
  745 + logger.warning("Failed to get %s analysis cache: %s", analysis_kind, e)
493 746 return None
494 747  
495 748  
496   -def _set_cached_anchor_result(
  749 +def _get_cached_anchor_result(
  750 + product: Dict[str, Any],
  751 + target_lang: str,
  752 +) -> Optional[Dict[str, Any]]:
  753 + return _get_cached_analysis_result(product, target_lang, analysis_kind="content")
  754 +
  755 +
  756 +def _set_cached_analysis_result(
497 757 product: Dict[str, Any],
498 758 target_lang: str,
499 759 result: Dict[str, Any],
  760 + analysis_kind: str,
  761 + category_taxonomy_profile: Optional[str] = None,
500 762 ) -> None:
501 763 if not _anchor_redis:
502 764 return
  765 + schema = _get_analysis_schema(
  766 + analysis_kind,
  767 + category_taxonomy_profile=category_taxonomy_profile,
  768 + )
503 769 try:
504   - normalized = _normalize_analysis_result(result, product=product, target_lang=target_lang)
505   - if not _has_meaningful_analysis_content(normalized):
  770 + normalized = _normalize_analysis_result(
  771 + result,
  772 + product=product,
  773 + target_lang=target_lang,
  774 + schema=schema,
  775 + )
  776 + if not _has_meaningful_analysis_content(normalized, schema):
506 777 return
507   - key = _make_anchor_cache_key(product, target_lang)
  778 + key = _make_analysis_cache_key(
  779 + product,
  780 + target_lang,
  781 + analysis_kind,
  782 + category_taxonomy_profile=category_taxonomy_profile,
  783 + )
508 784 ttl = ANCHOR_CACHE_EXPIRE_DAYS * 24 * 3600
509 785 _anchor_redis.setex(key, ttl, json.dumps(normalized, ensure_ascii=False))
510 786 except Exception as e:
511   - logger.warning(f"Failed to set anchor cache: {e}")
  787 + logger.warning("Failed to set %s analysis cache: %s", analysis_kind, e)
  788 +
  789 +
  790 +def _set_cached_anchor_result(
  791 + product: Dict[str, Any],
  792 + target_lang: str,
  793 + result: Dict[str, Any],
  794 +) -> None:
  795 + _set_cached_analysis_result(product, target_lang, result, analysis_kind="content")
512 796  
513 797  
514 798 def _build_assistant_prefix(headers: List[str]) -> str:
... ... @@ -517,8 +801,8 @@ def _build_assistant_prefix(headers: List[str]) -&gt; str:
517 801 return f"{header_line}\n{separator_line}\n"
518 802  
519 803  
520   -def _build_shared_context(products: List[Dict[str, str]]) -> str:
521   - shared_context = SHARED_ANALYSIS_INSTRUCTION
  804 +def _build_shared_context(products: List[Dict[str, str]], schema: AnalysisSchema) -> str:
  805 + shared_context = schema.shared_instruction
522 806 for idx, product in enumerate(products, 1):
523 807 prompt_input = _build_prompt_input_text(product)
524 808 shared_context += f"{idx}. {prompt_input}\n"
... ... @@ -550,16 +834,23 @@ def reset_logged_shared_context_keys() -&gt; None:
550 834 def create_prompt(
551 835 products: List[Dict[str, str]],
552 836 target_lang: str = "zh",
553   -) -> Tuple[str, str, str]:
  837 + analysis_kind: str = "content",
  838 + category_taxonomy_profile: Optional[str] = None,
  839 +) -> Tuple[Optional[str], Optional[str], Optional[str]]:
554 840 """根据目标语言创建共享上下文、本地化输出要求和 Partial Mode assistant 前缀。"""
555   - markdown_table_headers = LANGUAGE_MARKDOWN_TABLE_HEADERS.get(target_lang)
  841 + schema = _get_analysis_schema(
  842 + analysis_kind,
  843 + category_taxonomy_profile=category_taxonomy_profile,
  844 + )
  845 + markdown_table_headers = schema.get_headers(target_lang)
556 846 if not markdown_table_headers:
557 847 logger.warning(
558   - "Unsupported target_lang for markdown table headers: %s",
  848 + "Unsupported target_lang for markdown table headers: kind=%s lang=%s",
  849 + analysis_kind,
559 850 target_lang,
560 851 )
561 852 return None, None, None
562   - shared_context = _build_shared_context(products)
  853 + shared_context = _build_shared_context(products, schema)
563 854 language_label = SOURCE_LANG_CODE_MAP.get(target_lang, target_lang)
564 855 user_prompt = USER_INSTRUCTION_TEMPLATE.format(language=language_label).strip()
565 856 assistant_prefix = _build_assistant_prefix(markdown_table_headers)
... ... @@ -592,6 +883,7 @@ def call_llm(
592 883 user_prompt: str,
593 884 assistant_prefix: str,
594 885 target_lang: str = "zh",
  886 + analysis_kind: str = "content",
595 887 ) -> Tuple[str, str]:
596 888 """调用大模型 API(带重试机制),使用 Partial Mode 强制 markdown 表格前缀。"""
597 889 headers = {
... ... @@ -631,8 +923,9 @@ def call_llm(
631 923 if _mark_shared_context_logged_once(shared_context_key):
632 924 logger.info(f"\n{'=' * 80}")
633 925 logger.info(
634   - "LLM Shared Context [model=%s, shared_key=%s, chars=%s] (logged once per process key)",
  926 + "LLM Shared Context [model=%s, kind=%s, shared_key=%s, chars=%s] (logged once per process key)",
635 927 MODEL_NAME,
  928 + analysis_kind,
636 929 shared_context_key,
637 930 len(shared_context),
638 931 )
... ... @@ -641,8 +934,9 @@ def call_llm(
641 934  
642 935 verbose_logger.info(f"\n{'=' * 80}")
643 936 verbose_logger.info(
644   - "LLM Request [model=%s, lang=%s, shared_key=%s, tail_key=%s]:",
  937 + "LLM Request [model=%s, kind=%s, lang=%s, shared_key=%s, tail_key=%s]:",
645 938 MODEL_NAME,
  939 + analysis_kind,
646 940 target_lang,
647 941 shared_context_key,
648 942 localized_tail_key,
... ... @@ -654,7 +948,8 @@ def call_llm(
654 948 verbose_logger.info(f"\nAssistant Prefix:\n{assistant_prefix}")
655 949  
656 950 logger.info(
657   - "\nLLM Request Variant [lang=%s, shared_key=%s, tail_key=%s, prompt_chars=%s, prefix_chars=%s]",
  951 + "\nLLM Request Variant [kind=%s, lang=%s, shared_key=%s, tail_key=%s, prompt_chars=%s, prefix_chars=%s]",
  952 + analysis_kind,
658 953 target_lang,
659 954 shared_context_key,
660 955 localized_tail_key,
... ... @@ -685,8 +980,9 @@ def call_llm(
685 980 usage = result.get("usage") or {}
686 981  
687 982 verbose_logger.info(
688   - "\nLLM Response [model=%s, lang=%s, shared_key=%s, tail_key=%s]:",
  983 + "\nLLM Response [model=%s, kind=%s, lang=%s, shared_key=%s, tail_key=%s]:",
689 984 MODEL_NAME,
  985 + analysis_kind,
690 986 target_lang,
691 987 shared_context_key,
692 988 localized_tail_key,
... ... @@ -697,7 +993,8 @@ def call_llm(
697 993 full_markdown = _merge_partial_response(assistant_prefix, generated_content)
698 994  
699 995 logger.info(
700   - "\nLLM Response Summary [lang=%s, shared_key=%s, tail_key=%s, generated_chars=%s, completion_tokens=%s, prompt_tokens=%s, total_tokens=%s]",
  996 + "\nLLM Response Summary [kind=%s, lang=%s, shared_key=%s, tail_key=%s, generated_chars=%s, completion_tokens=%s, prompt_tokens=%s, total_tokens=%s]",
  997 + analysis_kind,
701 998 target_lang,
702 999 shared_context_key,
703 1000 localized_tail_key,
... ... @@ -742,8 +1039,16 @@ def call_llm(
742 1039 session.close()
743 1040  
744 1041  
745   -def parse_markdown_table(markdown_content: str) -> List[Dict[str, str]]:
  1042 +def parse_markdown_table(
  1043 + markdown_content: str,
  1044 + analysis_kind: str = "content",
  1045 + category_taxonomy_profile: Optional[str] = None,
  1046 +) -> List[Dict[str, str]]:
746 1047 """解析markdown表格内容"""
  1048 + schema = _get_analysis_schema(
  1049 + analysis_kind,
  1050 + category_taxonomy_profile=category_taxonomy_profile,
  1051 + )
747 1052 lines = markdown_content.strip().split("\n")
748 1053 data = []
749 1054 data_started = False
... ... @@ -768,22 +1073,16 @@ def parse_markdown_table(markdown_content: str) -&gt; List[Dict[str, str]]:
768 1073  
769 1074 # 解析数据行
770 1075 parts = [p.strip() for p in line.split("|")]
771   - parts = [p for p in parts if p] # 移除空字符串
  1076 + if parts and parts[0] == "":
  1077 + parts = parts[1:]
  1078 + if parts and parts[-1] == "":
  1079 + parts = parts[:-1]
772 1080  
773 1081 if len(parts) >= 2:
774   - row = {
775   - "seq_no": parts[0],
776   - "title": parts[1], # 商品标题(按目标语言)
777   - "category_path": parts[2] if len(parts) > 2 else "", # 品类路径
778   - "tags": parts[3] if len(parts) > 3 else "", # 细分标签
779   - "target_audience": parts[4] if len(parts) > 4 else "", # 适用人群
780   - "usage_scene": parts[5] if len(parts) > 5 else "", # 使用场景
781   - "season": parts[6] if len(parts) > 6 else "", # 适用季节
782   - "key_attributes": parts[7] if len(parts) > 7 else "", # 关键属性
783   - "material": parts[8] if len(parts) > 8 else "", # 材质说明
784   - "features": parts[9] if len(parts) > 9 else "", # 功能特点
785   - "anchor_text": parts[10] if len(parts) > 10 else "", # 锚文本
786   - }
  1082 + row = {"seq_no": parts[0]}
  1083 + for field_index, field_name in enumerate(schema.result_fields, start=1):
  1084 + cell = parts[field_index] if len(parts) > field_index else ""
  1085 + row[field_name] = _normalize_markdown_table_cell(cell)
787 1086 data.append(row)
788 1087  
789 1088 return data
... ... @@ -794,31 +1093,49 @@ def _log_parsed_result_quality(
794 1093 parsed_results: List[Dict[str, str]],
795 1094 target_lang: str,
796 1095 batch_num: int,
  1096 + analysis_kind: str,
  1097 + category_taxonomy_profile: Optional[str] = None,
797 1098 ) -> None:
  1099 + schema = _get_analysis_schema(
  1100 + analysis_kind,
  1101 + category_taxonomy_profile=category_taxonomy_profile,
  1102 + )
798 1103 expected = len(batch_data)
799 1104 actual = len(parsed_results)
800 1105 if actual != expected:
801 1106 logger.warning(
802   - "Parsed row count mismatch for batch=%s lang=%s: expected=%s actual=%s",
  1107 + "Parsed row count mismatch for kind=%s batch=%s lang=%s: expected=%s actual=%s",
  1108 + analysis_kind,
803 1109 batch_num,
804 1110 target_lang,
805 1111 expected,
806 1112 actual,
807 1113 )
808 1114  
809   - missing_anchor = sum(1 for item in parsed_results if not str(item.get("anchor_text") or "").strip())
810   - missing_category = sum(1 for item in parsed_results if not str(item.get("category_path") or "").strip())
811   - missing_title = sum(1 for item in parsed_results if not str(item.get("title") or "").strip())
  1115 + if not schema.quality_fields:
  1116 + logger.info(
  1117 + "Parsed Quality Summary [kind=%s, batch=%s, lang=%s]: rows=%s/%s",
  1118 + analysis_kind,
  1119 + batch_num,
  1120 + target_lang,
  1121 + actual,
  1122 + expected,
  1123 + )
  1124 + return
812 1125  
  1126 + missing_summary = ", ".join(
  1127 + f"missing_{field}="
  1128 + f"{sum(1 for item in parsed_results if not str(item.get(field) or '').strip())}"
  1129 + for field in schema.quality_fields
  1130 + )
813 1131 logger.info(
814   - "Parsed Quality Summary [batch=%s, lang=%s]: rows=%s/%s, missing_title=%s, missing_category=%s, missing_anchor=%s",
  1132 + "Parsed Quality Summary [kind=%s, batch=%s, lang=%s]: rows=%s/%s, %s",
  1133 + analysis_kind,
815 1134 batch_num,
816 1135 target_lang,
817 1136 actual,
818 1137 expected,
819   - missing_title,
820   - missing_category,
821   - missing_anchor,
  1138 + missing_summary,
822 1139 )
823 1140  
824 1141  
... ... @@ -826,29 +1143,44 @@ def process_batch(
826 1143 batch_data: List[Dict[str, str]],
827 1144 batch_num: int,
828 1145 target_lang: str = "zh",
  1146 + analysis_kind: str = "content",
  1147 + category_taxonomy_profile: Optional[str] = None,
829 1148 ) -> List[Dict[str, Any]]:
830 1149 """处理一个批次的数据"""
  1150 + schema = _get_analysis_schema(
  1151 + analysis_kind,
  1152 + category_taxonomy_profile=category_taxonomy_profile,
  1153 + )
831 1154 logger.info(f"\n{'#' * 80}")
832   - logger.info(f"Processing Batch {batch_num} ({len(batch_data)} items)")
  1155 + logger.info(
  1156 + "Processing Batch %s (%s items, kind=%s)",
  1157 + batch_num,
  1158 + len(batch_data),
  1159 + analysis_kind,
  1160 + )
833 1161  
834 1162 # 创建提示词
835 1163 shared_context, user_prompt, assistant_prefix = create_prompt(
836 1164 batch_data,
837 1165 target_lang=target_lang,
  1166 + analysis_kind=analysis_kind,
  1167 + category_taxonomy_profile=category_taxonomy_profile,
838 1168 )
839 1169  
840 1170 # 如果提示词创建失败(例如不支持的 target_lang),本次批次整体失败,不再继续调用 LLM
841 1171 if shared_context is None or user_prompt is None or assistant_prefix is None:
842 1172 logger.error(
843   - "Failed to create prompt for batch %s, target_lang=%s; "
  1173 + "Failed to create prompt for batch %s, kind=%s, target_lang=%s; "
844 1174 "marking entire batch as failed without calling LLM",
845 1175 batch_num,
  1176 + analysis_kind,
846 1177 target_lang,
847 1178 )
848 1179 return [
849 1180 _make_empty_analysis_result(
850 1181 item,
851 1182 target_lang,
  1183 + schema,
852 1184 error=f"prompt_creation_failed: unsupported target_lang={target_lang}",
853 1185 )
854 1186 for item in batch_data
... ... @@ -861,11 +1193,23 @@ def process_batch(
861 1193 user_prompt,
862 1194 assistant_prefix,
863 1195 target_lang=target_lang,
  1196 + analysis_kind=analysis_kind,
864 1197 )
865 1198  
866 1199 # 解析结果
867   - parsed_results = parse_markdown_table(raw_response)
868   - _log_parsed_result_quality(batch_data, parsed_results, target_lang, batch_num)
  1200 + parsed_results = parse_markdown_table(
  1201 + raw_response,
  1202 + analysis_kind=analysis_kind,
  1203 + category_taxonomy_profile=category_taxonomy_profile,
  1204 + )
  1205 + _log_parsed_result_quality(
  1206 + batch_data,
  1207 + parsed_results,
  1208 + target_lang,
  1209 + batch_num,
  1210 + analysis_kind,
  1211 + category_taxonomy_profile,
  1212 + )
869 1213  
870 1214 logger.info(f"\nParsed Results ({len(parsed_results)} items):")
871 1215 logger.info(json.dumps(parsed_results, ensure_ascii=False, indent=2))
... ... @@ -879,10 +1223,12 @@ def process_batch(
879 1223 parsed_item,
880 1224 product=source_product,
881 1225 target_lang=target_lang,
  1226 + schema=schema,
882 1227 )
883 1228 results_with_ids.append(result)
884 1229 logger.info(
885   - "Mapped: seq=%s -> original_id=%s",
  1230 + "Mapped: kind=%s seq=%s -> original_id=%s",
  1231 + analysis_kind,
886 1232 parsed_item.get("seq_no"),
887 1233 source_product.get("id"),
888 1234 )
... ... @@ -890,6 +1236,7 @@ def process_batch(
890 1236 # 保存批次 JSON 日志到独立文件
891 1237 batch_log = {
892 1238 "batch_num": batch_num,
  1239 + "analysis_kind": analysis_kind,
893 1240 "timestamp": datetime.now().isoformat(),
894 1241 "input_products": batch_data,
895 1242 "raw_response": raw_response,
... ... @@ -900,7 +1247,10 @@ def process_batch(
900 1247  
901 1248 # 并发写 batch json 日志时,保证文件名唯一避免覆盖
902 1249 batch_call_id = uuid.uuid4().hex[:12]
903   - batch_log_file = LOG_DIR / f"batch_{batch_num:04d}_{timestamp}_{batch_call_id}.json"
  1250 + batch_log_file = (
  1251 + LOG_DIR
  1252 + / f"batch_{analysis_kind}_{batch_num:04d}_{timestamp}_{batch_call_id}.json"
  1253 + )
904 1254 with open(batch_log_file, "w", encoding="utf-8") as f:
905 1255 json.dump(batch_log, f, ensure_ascii=False, indent=2)
906 1256  
... ... @@ -912,7 +1262,7 @@ def process_batch(
912 1262 logger.error(f"Error processing batch {batch_num}: {str(e)}", exc_info=True)
913 1263 # 返回空结果,保持ID映射
914 1264 return [
915   - _make_empty_analysis_result(item, target_lang, error=str(e))
  1265 + _make_empty_analysis_result(item, target_lang, schema, error=str(e))
916 1266 for item in batch_data
917 1267 ]
918 1268  
... ... @@ -922,6 +1272,8 @@ def analyze_products(
922 1272 target_lang: str = "zh",
923 1273 batch_size: Optional[int] = None,
924 1274 tenant_id: Optional[str] = None,
  1275 + analysis_kind: str = "content",
  1276 + category_taxonomy_profile: Optional[str] = None,
925 1277 ) -> List[Dict[str, Any]]:
926 1278 """
927 1279 库调用入口:根据输入+语言,返回锚文本及各维度信息。
... ... @@ -937,6 +1289,10 @@ def analyze_products(
937 1289 if not products:
938 1290 return []
939 1291  
  1292 + _get_analysis_schema(
  1293 + analysis_kind,
  1294 + category_taxonomy_profile=category_taxonomy_profile,
  1295 + )
940 1296 results_by_index: List[Optional[Dict[str, Any]]] = [None] * len(products)
941 1297 uncached_items: List[Tuple[int, Dict[str, str]]] = []
942 1298  
... ... @@ -946,11 +1302,16 @@ def analyze_products(
946 1302 uncached_items.append((idx, product))
947 1303 continue
948 1304  
949   - cached = _get_cached_anchor_result(product, target_lang)
  1305 + cached = _get_cached_analysis_result(
  1306 + product,
  1307 + target_lang,
  1308 + analysis_kind,
  1309 + category_taxonomy_profile=category_taxonomy_profile,
  1310 + )
950 1311 if cached:
951 1312 logger.info(
952 1313 f"[analyze_products] Cache hit for title='{title[:50]}...', "
953   - f"lang={target_lang}"
  1314 + f"kind={analysis_kind}, lang={target_lang}"
954 1315 )
955 1316 results_by_index[idx] = cached
956 1317 continue
... ... @@ -979,9 +1340,15 @@ def analyze_products(
979 1340 for batch_num, batch_slice, batch in batch_jobs:
980 1341 logger.info(
981 1342 f"[analyze_products] Processing batch {batch_num}/{total_batches}, "
982   - f"size={len(batch)}, target_lang={target_lang}"
  1343 + f"size={len(batch)}, kind={analysis_kind}, target_lang={target_lang}"
  1344 + )
  1345 + batch_results = process_batch(
  1346 + batch,
  1347 + batch_num=batch_num,
  1348 + target_lang=target_lang,
  1349 + analysis_kind=analysis_kind,
  1350 + category_taxonomy_profile=category_taxonomy_profile,
983 1351 )
984   - batch_results = process_batch(batch, batch_num=batch_num, target_lang=target_lang)
985 1352  
986 1353 for (original_idx, product), item in zip(batch_slice, batch_results):
987 1354 results_by_index[original_idx] = item
... ... @@ -992,7 +1359,13 @@ def analyze_products(
992 1359 # 不缓存错误结果,避免放大临时故障
993 1360 continue
994 1361 try:
995   - _set_cached_anchor_result(product, target_lang, item)
  1362 + _set_cached_analysis_result(
  1363 + product,
  1364 + target_lang,
  1365 + item,
  1366 + analysis_kind,
  1367 + category_taxonomy_profile=category_taxonomy_profile,
  1368 + )
996 1369 except Exception:
997 1370 # 已在内部记录 warning
998 1371 pass
... ... @@ -1000,10 +1373,11 @@ def analyze_products(
1000 1373 max_workers = min(CONTENT_UNDERSTANDING_MAX_WORKERS, len(batch_jobs))
1001 1374 logger.info(
1002 1375 "[analyze_products] Using ThreadPoolExecutor for uncached batches: "
1003   - "max_workers=%s, total_batches=%s, bs=%s, target_lang=%s",
  1376 + "max_workers=%s, total_batches=%s, bs=%s, kind=%s, target_lang=%s",
1004 1377 max_workers,
1005 1378 total_batches,
1006 1379 bs,
  1380 + analysis_kind,
1007 1381 target_lang,
1008 1382 )
1009 1383  
... ... @@ -1013,7 +1387,12 @@ def analyze_products(
1013 1387 future_by_batch_num: Dict[int, Any] = {}
1014 1388 for batch_num, _batch_slice, batch in batch_jobs:
1015 1389 future_by_batch_num[batch_num] = executor.submit(
1016   - process_batch, batch, batch_num=batch_num, target_lang=target_lang
  1390 + process_batch,
  1391 + batch,
  1392 + batch_num=batch_num,
  1393 + target_lang=target_lang,
  1394 + analysis_kind=analysis_kind,
  1395 + category_taxonomy_profile=category_taxonomy_profile,
1017 1396 )
1018 1397  
1019 1398 # 按 batch_num 回填,确保输出稳定(results_by_index 是按原始 input index 映射的)
... ... @@ -1028,7 +1407,13 @@ def analyze_products(
1028 1407 # 不缓存错误结果,避免放大临时故障
1029 1408 continue
1030 1409 try:
1031   - _set_cached_anchor_result(product, target_lang, item)
  1410 + _set_cached_analysis_result(
  1411 + product,
  1412 + target_lang,
  1413 + item,
  1414 + analysis_kind,
  1415 + category_taxonomy_profile=category_taxonomy_profile,
  1416 + )
1032 1417 except Exception:
1033 1418 # 已在内部记录 warning
1034 1419 pass
... ...
indexer/product_enrich_prompts.py
1 1 #!/usr/bin/env python3
2 2  
3   -from typing import Any, Dict
  3 +from typing import Any, Dict, Tuple
4 4  
5 5 SYSTEM_MESSAGE = (
6 6 "You are an e-commerce product annotator. "
... ... @@ -33,6 +33,337 @@ Input product list:
33 33 USER_INSTRUCTION_TEMPLATE = """Please strictly return a Markdown table following the given columns in the specified language. For any column containing multiple values, separate them with commas. Do not add any other explanation.
34 34 Language: {language}"""
35 35  
  36 +def _taxonomy_field(
  37 + key: str,
  38 + label: str,
  39 + description: str,
  40 + zh_label: str | None = None,
  41 +) -> Dict[str, str]:
  42 + return {
  43 + "key": key,
  44 + "label": label,
  45 + "description": description,
  46 + "zh_label": zh_label or label,
  47 + }
  48 +
  49 +
  50 +def _build_taxonomy_shared_instruction(profile_label: str, fields: Tuple[Dict[str, str], ...]) -> str:
  51 + lines = [
  52 + f"Analyze each input product text and fill the columns below using a {profile_label} attribute taxonomy.",
  53 + "",
  54 + "Output columns:",
  55 + ]
  56 + for idx, field in enumerate(fields, start=1):
  57 + lines.append(f"{idx}. {field['label']}: {field['description']}")
  58 + lines.extend(
  59 + [
  60 + "",
  61 + "Rules:",
  62 + "- Keep the same row order and row count as input.",
  63 + "- Leave blank if not applicable, unmentioned, or unsupported.",
  64 + "- Use concise, standardized ecommerce wording.",
  65 + "- If multiple values, separate with commas.",
  66 + "",
  67 + "Input product list:",
  68 + ]
  69 + )
  70 + return "\n".join(lines)
  71 +
  72 +
  73 +def _make_taxonomy_profile(
  74 + profile_label: str,
  75 + fields: Tuple[Dict[str, str], ...],
  76 +) -> Dict[str, Any]:
  77 + headers = {
  78 + "en": ["No.", *[field["label"] for field in fields]],
  79 + "zh": ["序号", *[field["zh_label"] for field in fields]],
  80 + }
  81 + return {
  82 + "profile_label": profile_label,
  83 + "fields": fields,
  84 + "shared_instruction": _build_taxonomy_shared_instruction(profile_label, fields),
  85 + "markdown_table_headers": headers,
  86 + }
  87 +
  88 +
  89 +APPAREL_TAXONOMY_FIELDS = (
  90 + _taxonomy_field("product_type", "Product Type", "concise ecommerce apparel category label, not a full marketing title", "品类"),
  91 + _taxonomy_field("target_gender", "Target Gender", "intended gender only if clearly implied", "目标性别"),
  92 + _taxonomy_field("age_group", "Age Group", "only if clearly implied, e.g. adults, kids, teens, toddlers, babies", "年龄段"),
  93 + _taxonomy_field("season", "Season", "season(s) or all-season suitability only if supported", "适用季节"),
  94 + _taxonomy_field("fit", "Fit", "body closeness, e.g. slim, regular, relaxed, oversized, fitted", "版型"),
  95 + _taxonomy_field("silhouette", "Silhouette", "overall garment shape, e.g. straight, A-line, boxy, tapered, bodycon, wide-leg", "廓形"),
  96 + _taxonomy_field("neckline", "Neckline", "neckline type when applicable, e.g. crew neck, V-neck, hooded, collared, square neck", "领型"),
  97 + _taxonomy_field("sleeve_length_type", "Sleeve Length Type", "sleeve length only, e.g. sleeveless, short sleeve, long sleeve, three-quarter sleeve", "袖长类型"),
  98 + _taxonomy_field("sleeve_style", "Sleeve Style", "sleeve design only, e.g. puff sleeve, raglan sleeve, batwing sleeve, bell sleeve", "袖型"),
  99 + _taxonomy_field("strap_type", "Strap Type", "strap design when applicable, e.g. spaghetti strap, wide strap, halter strap, adjustable strap", "肩带设计"),
  100 + _taxonomy_field("rise_waistline", "Rise / Waistline", "waist placement when applicable, e.g. high rise, mid rise, low rise, empire waist", "腰型"),
  101 + _taxonomy_field("leg_shape", "Leg Shape", "for bottoms only, e.g. straight leg, wide leg, flare leg, tapered leg, skinny leg", "裤型"),
  102 + _taxonomy_field("skirt_shape", "Skirt Shape", "for skirts only, e.g. A-line, pleated, pencil, mermaid", "裙型"),
  103 + _taxonomy_field("length_type", "Length Type", "design length only, not size, e.g. cropped, regular, longline, mini, midi, maxi, ankle length, full length", "长度类型"),
  104 + _taxonomy_field("closure_type", "Closure Type", "fastening method when applicable, e.g. zipper, button, drawstring, elastic waist, hook-and-loop", "闭合方式"),
  105 + _taxonomy_field("design_details", "Design Details", "construction or visual details, e.g. ruched, ruffled, pleated, cut-out, layered, distressed, split hem", "设计细节"),
  106 + _taxonomy_field("fabric", "Fabric", "fabric type only, e.g. denim, knit, chiffon, jersey, fleece, cotton twill", "面料"),
  107 + _taxonomy_field("material_composition", "Material Composition", "fiber content or blend only if stated, e.g. cotton, polyester, spandex, linen blend, 95% cotton 5% elastane", "成分"),
  108 + _taxonomy_field("fabric_properties", "Fabric Properties", "inherent fabric traits, e.g. stretch, breathable, lightweight, soft-touch, water-resistant", "面料特性"),
  109 + _taxonomy_field("clothing_features", "Clothing Features", "product features, e.g. lined, reversible, hooded, packable, padded, pocketed", "服装特征"),
  110 + _taxonomy_field("functional_benefits", "Functional Benefits", "wearer benefits, e.g. moisture-wicking, thermal insulation, UV protection, easy care, supportive compression", "功能"),
  111 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  112 + _taxonomy_field("color_family", "Color Family", "normalized broad retail color group, e.g. black, white, blue, green, red, pink, beige, brown, gray", "色系"),
  113 + _taxonomy_field("print_pattern", "Print / Pattern", "surface pattern when applicable, e.g. solid, striped, plaid, floral, graphic, animal print", "印花 / 图案"),
  114 + _taxonomy_field("occasion_end_use", "Occasion / End Use", "likely use occasion only if supported, e.g. office, casual wear, streetwear, lounge, workout, outdoor", "适用场景"),
  115 + _taxonomy_field("style_aesthetic", "Style Aesthetic", "overall style only if supported, e.g. minimalist, streetwear, athleisure, smart casual, romantic, playful", "风格"),
  116 +)
  117 +
  118 +THREE_C_TAXONOMY_FIELDS = (
  119 + _taxonomy_field("product_type", "Product Type", "concise 3C accessory or peripheral category label", "品类"),
  120 + _taxonomy_field("compatible_device", "Compatible Device / Model", "supported device family, series, model, or form factor when clearly stated", "适配设备 / 型号"),
  121 + _taxonomy_field("connectivity", "Connectivity", "connection method such as wired, wireless, Bluetooth, Wi-Fi, NFC, or 2.4G", "连接方式"),
  122 + _taxonomy_field("interface_port_type", "Interface / Port Type", "relevant connector or port, e.g. USB-C, Lightning, HDMI, AUX, RJ45", "接口 / 端口类型"),
  123 + _taxonomy_field("power_charging", "Power Source / Charging", "charging or power mode, e.g. battery powered, fast charging, rechargeable, plug-in", "供电 / 充电方式"),
  124 + _taxonomy_field("key_features", "Key Features", "primary hardware features such as noise cancelling, foldable, magnetic, backlit, waterproof", "关键特征"),
  125 + _taxonomy_field("material_finish", "Material / Finish", "main material or exterior finish when supported", "材质 / 表面处理"),
  126 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  127 + _taxonomy_field("pack_size", "Pack Size", "unit count or bundle size when stated", "包装规格"),
  128 + _taxonomy_field("use_case", "Use Case", "intended usage such as travel, office, gaming, car, charging, streaming", "使用场景"),
  129 +)
  130 +
  131 +BAGS_TAXONOMY_FIELDS = (
  132 + _taxonomy_field("product_type", "Product Type", "concise bag category such as backpack, tote bag, crossbody bag, luggage, or wallet", "品类"),
  133 + _taxonomy_field("target_gender", "Target Gender", "intended gender only if clearly implied", "目标性别"),
  134 + _taxonomy_field("carry_style", "Carry Style", "how the bag is worn or carried, e.g. handheld, shoulder, crossbody, backpack", "携带方式"),
  135 + _taxonomy_field("size_capacity", "Size / Capacity", "size tier or capacity when supported, e.g. mini, large capacity, 20L", "尺寸 / 容量"),
  136 + _taxonomy_field("material", "Material", "main bag material such as leather, nylon, canvas, PU, straw", "材质"),
  137 + _taxonomy_field("closure_type", "Closure Type", "bag closure such as zipper, flap, buckle, drawstring, magnetic snap", "闭合方式"),
  138 + _taxonomy_field("structure_compartments", "Structure / Compartments", "organizational structure such as multi-pocket, laptop sleeve, card slots, expandable", "结构 / 分层"),
  139 + _taxonomy_field("strap_handle_type", "Strap / Handle Type", "strap or handle design such as chain strap, top handle, adjustable strap", "肩带 / 提手类型"),
  140 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  141 + _taxonomy_field("occasion_end_use", "Occasion / End Use", "likely use such as commute, travel, evening, school, casual", "适用场景"),
  142 +)
  143 +
  144 +PET_SUPPLIES_TAXONOMY_FIELDS = (
  145 + _taxonomy_field("product_type", "Product Type", "concise pet supplies category label", "品类"),
  146 + _taxonomy_field("pet_type", "Pet Type", "target pet such as dog, cat, bird, fish, hamster", "宠物类型"),
  147 + _taxonomy_field("breed_size", "Breed Size", "pet size or breed size when stated, e.g. small breed, large dogs", "体型 / 品种大小"),
  148 + _taxonomy_field("life_stage", "Life Stage", "pet age stage when supported, e.g. puppy, kitten, adult, senior", "成长阶段"),
  149 + _taxonomy_field("material_ingredients", "Material / Ingredients", "main material or ingredient composition when supported", "材质 / 成分"),
  150 + _taxonomy_field("flavor_scent", "Flavor / Scent", "flavor or scent when applicable", "口味 / 气味"),
  151 + _taxonomy_field("key_features", "Key Features", "primary attributes such as interactive, leak-proof, orthopedic, washable, elevated", "关键特征"),
  152 + _taxonomy_field("functional_benefits", "Functional Benefits", "benefits such as dental care, calming, digestion support, joint support", "功能"),
  153 + _taxonomy_field("size_capacity", "Size / Capacity", "size, count, or net content when stated", "尺寸 / 容量"),
  154 + _taxonomy_field("use_scenario", "Use Scenario", "usage such as feeding, training, grooming, travel, indoor play", "使用场景"),
  155 +)
  156 +
  157 +ELECTRONICS_TAXONOMY_FIELDS = (
  158 + _taxonomy_field("product_type", "Product Type", "concise electronics device or component category label", "品类"),
  159 + _taxonomy_field("device_category", "Device Category / Compatibility", "supported platform, component class, or compatible device family when stated", "设备类别 / 兼容性"),
  160 + _taxonomy_field("power_voltage", "Power / Voltage", "power, voltage, wattage, or battery spec when supported", "功率 / 电压"),
  161 + _taxonomy_field("connectivity", "Connectivity", "connection method such as wired, Bluetooth, Wi-Fi, RF, or smart app control", "连接方式"),
  162 + _taxonomy_field("interface_port_type", "Interface / Port Type", "relevant port or interface such as USB-C, AC plug type, HDMI, SATA", "接口 / 端口类型"),
  163 + _taxonomy_field("capacity_storage", "Capacity / Storage", "capacity or storage spec such as 256GB, 2TB, 5000mAh", "容量 / 存储"),
  164 + _taxonomy_field("key_features", "Key Features", "main product features such as touch control, HD display, noise reduction, smart control", "关键特征"),
  165 + _taxonomy_field("material_finish", "Material / Finish", "main housing material or finish when supported", "材质 / 表面处理"),
  166 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  167 + _taxonomy_field("use_case", "Use Case", "intended use such as home entertainment, office, charging, security, repair", "使用场景"),
  168 +)
  169 +
  170 +OUTDOOR_TAXONOMY_FIELDS = (
  171 + _taxonomy_field("product_type", "Product Type", "concise outdoor gear category label", "品类"),
  172 + _taxonomy_field("activity_type", "Activity Type", "primary outdoor activity such as camping, hiking, fishing, climbing, travel", "活动类型"),
  173 + _taxonomy_field("season_weather", "Season / Weather", "season or weather suitability when supported", "适用季节 / 天气"),
  174 + _taxonomy_field("material", "Material", "main material such as aluminum, ripstop nylon, stainless steel, EVA", "材质"),
  175 + _taxonomy_field("capacity_size", "Capacity / Size", "size, length, or capacity when stated", "容量 / 尺寸"),
  176 + _taxonomy_field("protection_resistance", "Protection / Resistance", "resistance or protection such as waterproof, UV resistant, windproof", "防护 / 耐受性"),
  177 + _taxonomy_field("key_features", "Key Features", "primary gear attributes such as foldable, lightweight, insulated, non-slip", "关键特征"),
  178 + _taxonomy_field("portability_packability", "Portability / Packability", "carry or storage trait such as collapsible, compact, ultralight, packable", "便携 / 收纳性"),
  179 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  180 + _taxonomy_field("use_scenario", "Use Scenario", "likely use setting such as campsite, trail, survival kit, beach, picnic", "使用场景"),
  181 +)
  182 +
  183 +HOME_APPLIANCES_TAXONOMY_FIELDS = (
  184 + _taxonomy_field("product_type", "Product Type", "concise home appliance category label", "品类"),
  185 + _taxonomy_field("appliance_category", "Appliance Category", "functional class such as kitchen appliance, cleaning appliance, personal care appliance", "家电类别"),
  186 + _taxonomy_field("power_voltage", "Power / Voltage", "wattage, voltage, plug type, or power supply when supported", "功率 / 电压"),
  187 + _taxonomy_field("capacity_coverage", "Capacity / Coverage", "capacity or coverage metric such as 1.5L, 20L, 40sqm", "容量 / 覆盖范围"),
  188 + _taxonomy_field("control_method", "Control Method", "operation method such as touch, knob, remote, app control", "控制方式"),
  189 + _taxonomy_field("installation_type", "Installation Type", "setup style such as countertop, handheld, portable, wall-mounted, built-in", "安装方式"),
  190 + _taxonomy_field("key_features", "Key Features", "main product features such as timer, steam, HEPA filter, self-cleaning", "关键特征"),
  191 + _taxonomy_field("material_finish", "Material / Finish", "main material or exterior finish when supported", "材质 / 表面处理"),
  192 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  193 + _taxonomy_field("use_scenario", "Use Scenario", "intended use such as cooking, cleaning, grooming, cooling, air treatment", "使用场景"),
  194 +)
  195 +
  196 +HOME_LIVING_TAXONOMY_FIELDS = (
  197 + _taxonomy_field("product_type", "Product Type", "concise home and living category label", "品类"),
  198 + _taxonomy_field("room_placement", "Room / Placement", "intended room or placement such as bedroom, kitchen, bathroom, desktop", "适用空间 / 摆放位置"),
  199 + _taxonomy_field("material", "Material", "main material such as wood, ceramic, cotton, glass, metal", "材质"),
  200 + _taxonomy_field("style", "Style", "home style such as modern, farmhouse, minimalist, boho, Nordic", "风格"),
  201 + _taxonomy_field("size_dimensions", "Size / Dimensions", "size or dimensions when stated", "尺寸 / 规格"),
  202 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  203 + _taxonomy_field("pattern_finish", "Pattern / Finish", "surface pattern or finish such as solid, marble, matte, ribbed", "图案 / 表面处理"),
  204 + _taxonomy_field("key_features", "Key Features", "main product features such as stackable, washable, blackout, space-saving", "关键特征"),
  205 + _taxonomy_field("assembly_installation", "Assembly / Installation", "assembly or installation trait when supported", "组装 / 安装"),
  206 + _taxonomy_field("use_scenario", "Use Scenario", "intended use such as storage, dining, decor, sleep, organization", "使用场景"),
  207 +)
  208 +
  209 +WIGS_TAXONOMY_FIELDS = (
  210 + _taxonomy_field("product_type", "Product Type", "concise wig or hairpiece category label", "品类"),
  211 + _taxonomy_field("hair_material", "Hair Material", "hair material such as human hair, synthetic fiber, heat-resistant fiber", "发丝材质"),
  212 + _taxonomy_field("hair_texture", "Hair Texture", "texture or curl pattern such as straight, body wave, curly, kinky", "发质纹理"),
  213 + _taxonomy_field("hair_length", "Hair Length", "hair length when stated", "发长"),
  214 + _taxonomy_field("hair_color", "Hair Color", "specific hair color or blend when available", "发色"),
  215 + _taxonomy_field("cap_construction", "Cap Construction", "cap type such as full lace, lace front, glueless, U part", "帽网结构"),
  216 + _taxonomy_field("lace_area_part_type", "Lace Area / Part Type", "lace size or part style such as 13x4 lace, middle part, T part", "蕾丝面积 / 分缝类型"),
  217 + _taxonomy_field("density_volume", "Density / Volume", "hair density or fullness when supported", "密度 / 发量"),
  218 + _taxonomy_field("style_bang_type", "Style / Bang Type", "style cue such as bob, pixie, layered, with bangs", "款式 / 刘海类型"),
  219 + _taxonomy_field("occasion_end_use", "Occasion / End Use", "intended use such as daily wear, cosplay, protective style, party", "适用场景"),
  220 +)
  221 +
  222 +BEAUTY_TAXONOMY_FIELDS = (
  223 + _taxonomy_field("product_type", "Product Type", "concise beauty or cosmetics category label", "品类"),
  224 + _taxonomy_field("target_area", "Target Area", "target area such as face, lips, eyes, nails, hair, body", "适用部位"),
  225 + _taxonomy_field("skin_hair_type", "Skin Type / Hair Type", "suitable skin or hair type when supported", "肤质 / 发质"),
  226 + _taxonomy_field("finish_effect", "Finish / Effect", "cosmetic finish or effect such as matte, dewy, volumizing, brightening", "妆效 / 效果"),
  227 + _taxonomy_field("key_ingredients", "Key Ingredients", "notable ingredients when stated", "关键成分"),
  228 + _taxonomy_field("shade_color", "Shade / Color", "specific shade or color when available", "色号 / 颜色"),
  229 + _taxonomy_field("scent", "Scent", "fragrance or scent only when supported", "香味"),
  230 + _taxonomy_field("formulation", "Formulation", "product form such as cream, serum, powder, gel, stick", "剂型 / 形态"),
  231 + _taxonomy_field("functional_benefits", "Functional Benefits", "benefits such as hydration, anti-aging, long-wear, repair, sun protection", "功能"),
  232 + _taxonomy_field("use_scenario", "Use Scenario", "intended use such as daily routine, salon, travel, evening makeup", "使用场景"),
  233 +)
  234 +
  235 +ACCESSORIES_TAXONOMY_FIELDS = (
  236 + _taxonomy_field("product_type", "Product Type", "concise accessory category label such as necklace, watch, belt, hat, or sunglasses", "品类"),
  237 + _taxonomy_field("target_gender", "Target Gender", "intended gender only if clearly implied", "目标性别"),
  238 + _taxonomy_field("material", "Material", "main material such as alloy, leather, stainless steel, acetate, fabric", "材质"),
  239 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  240 + _taxonomy_field("pattern_finish", "Pattern / Finish", "surface treatment or style finish such as polished, textured, braided, rhinestone", "图案 / 表面处理"),
  241 + _taxonomy_field("closure_fastening", "Closure / Fastening", "fastening method when applicable", "闭合 / 固定方式"),
  242 + _taxonomy_field("size_fit", "Size / Fit", "size or fit information such as adjustable, one size, 42mm", "尺寸 / 适配"),
  243 + _taxonomy_field("style", "Style", "style cue such as minimalist, vintage, statement, sporty", "风格"),
  244 + _taxonomy_field("occasion_end_use", "Occasion / End Use", "likely use such as daily wear, formal, party, travel, sun protection", "适用场景"),
  245 + _taxonomy_field("set_pack_size", "Set / Pack Size", "set count or pack size when stated", "套装 / 规格"),
  246 +)
  247 +
  248 +TOYS_TAXONOMY_FIELDS = (
  249 + _taxonomy_field("product_type", "Product Type", "concise toy category label", "品类"),
  250 + _taxonomy_field("age_group", "Age Group", "intended age group when clearly implied", "年龄段"),
  251 + _taxonomy_field("character_theme", "Character / Theme", "licensed character, theme, or play theme when supported", "角色 / 主题"),
  252 + _taxonomy_field("material", "Material", "main toy material such as plush, plastic, wood, silicone", "材质"),
  253 + _taxonomy_field("power_source", "Power Source", "battery, rechargeable, wind-up, or non-powered when supported", "供电方式"),
  254 + _taxonomy_field("interactive_features", "Interactive Features", "interactive functions such as sound, lights, remote control, motion", "互动功能"),
  255 + _taxonomy_field("educational_play_value", "Educational / Play Value", "play value such as STEM, pretend play, sensory, puzzle solving", "教育 / 可玩性"),
  256 + _taxonomy_field("piece_count_size", "Piece Count / Size", "piece count or size when stated", "件数 / 尺寸"),
  257 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  258 + _taxonomy_field("use_scenario", "Use Scenario", "intended use such as indoor play, bath time, party favor, outdoor play", "使用场景"),
  259 +)
  260 +
  261 +SHOES_TAXONOMY_FIELDS = (
  262 + _taxonomy_field("product_type", "Product Type", "concise footwear category label", "品类"),
  263 + _taxonomy_field("target_gender", "Target Gender", "intended gender only if clearly implied", "目标性别"),
  264 + _taxonomy_field("age_group", "Age Group", "only if clearly implied", "年龄段"),
  265 + _taxonomy_field("closure_type", "Closure Type", "fastening method such as lace-up, slip-on, buckle, hook-and-loop", "闭合方式"),
  266 + _taxonomy_field("toe_shape", "Toe Shape", "toe shape when applicable, e.g. round toe, pointed toe, open toe", "鞋头形状"),
  267 + _taxonomy_field("heel_sole_type", "Heel Height / Sole Type", "heel or sole profile such as flat, block heel, wedge, platform, thick sole", "跟高 / 鞋底类型"),
  268 + _taxonomy_field("upper_material", "Upper Material", "main upper material such as leather, knit, canvas, mesh", "鞋面材质"),
  269 + _taxonomy_field("lining_insole_material", "Lining / Insole Material", "lining or insole material when supported", "里料 / 鞋垫材质"),
  270 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  271 + _taxonomy_field("occasion_end_use", "Occasion / End Use", "likely use such as running, casual, office, hiking, formal", "适用场景"),
  272 +)
  273 +
  274 +SPORTS_TAXONOMY_FIELDS = (
  275 + _taxonomy_field("product_type", "Product Type", "concise sports product category label", "品类"),
  276 + _taxonomy_field("sport_activity", "Sport / Activity", "primary sport or activity such as fitness, yoga, basketball, cycling, swimming", "运动 / 活动"),
  277 + _taxonomy_field("skill_level", "Skill Level", "target user level when supported, e.g. beginner, training, professional", "适用水平"),
  278 + _taxonomy_field("material", "Material", "main material such as EVA, carbon fiber, neoprene, latex", "材质"),
  279 + _taxonomy_field("size_capacity", "Size / Capacity", "size, weight, resistance level, or capacity when stated", "尺寸 / 容量"),
  280 + _taxonomy_field("protection_support", "Protection / Support", "support or protection function such as ankle support, shock absorption, impact protection", "防护 / 支撑"),
  281 + _taxonomy_field("key_features", "Key Features", "main features such as anti-slip, adjustable, foldable, quick-dry", "关键特征"),
  282 + _taxonomy_field("power_source", "Power Source", "battery, electric, or non-powered when applicable", "供电方式"),
  283 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  284 + _taxonomy_field("use_scenario", "Use Scenario", "intended use such as gym, home workout, field training, competition", "使用场景"),
  285 +)
  286 +
  287 +OTHERS_TAXONOMY_FIELDS = (
  288 + _taxonomy_field("product_type", "Product Type", "concise product category label, not a full marketing title", "品类"),
  289 + _taxonomy_field("product_category", "Product Category", "broader retail grouping when the specific product type is narrow", "商品类别"),
  290 + _taxonomy_field("target_user", "Target User", "intended user, audience, or recipient when clearly implied", "适用人群"),
  291 + _taxonomy_field("material_ingredients", "Material / Ingredients", "main material or ingredients when supported", "材质 / 成分"),
  292 + _taxonomy_field("key_features", "Key Features", "primary product attributes or standout features", "关键特征"),
  293 + _taxonomy_field("functional_benefits", "Functional Benefits", "practical benefits or performance advantages when supported", "功能"),
  294 + _taxonomy_field("size_capacity", "Size / Capacity", "size, count, weight, or capacity when stated", "尺寸 / 容量"),
  295 + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"),
  296 + _taxonomy_field("style_theme", "Style / Theme", "overall style, design theme, or visual direction when supported", "风格 / 主题"),
  297 + _taxonomy_field("use_scenario", "Use Scenario", "likely use occasion or application setting when supported", "使用场景"),
  298 +)
  299 +
  300 +CATEGORY_TAXONOMY_PROFILES: Dict[str, Dict[str, Any]] = {
  301 + "apparel": _make_taxonomy_profile(
  302 + "apparel",
  303 + APPAREL_TAXONOMY_FIELDS,
  304 + ),
  305 + "3c": _make_taxonomy_profile(
  306 + "3C",
  307 + THREE_C_TAXONOMY_FIELDS,
  308 + ),
  309 + "bags": _make_taxonomy_profile(
  310 + "bags",
  311 + BAGS_TAXONOMY_FIELDS,
  312 + ),
  313 + "pet_supplies": _make_taxonomy_profile(
  314 + "pet supplies",
  315 + PET_SUPPLIES_TAXONOMY_FIELDS,
  316 + ),
  317 + "electronics": _make_taxonomy_profile(
  318 + "electronics",
  319 + ELECTRONICS_TAXONOMY_FIELDS,
  320 + ),
  321 + "outdoor": _make_taxonomy_profile(
  322 + "outdoor products",
  323 + OUTDOOR_TAXONOMY_FIELDS,
  324 + ),
  325 + "home_appliances": _make_taxonomy_profile(
  326 + "home appliances",
  327 + HOME_APPLIANCES_TAXONOMY_FIELDS,
  328 + ),
  329 + "home_living": _make_taxonomy_profile(
  330 + "home and living",
  331 + HOME_LIVING_TAXONOMY_FIELDS,
  332 + ),
  333 + "wigs": _make_taxonomy_profile(
  334 + "wigs",
  335 + WIGS_TAXONOMY_FIELDS,
  336 + ),
  337 + "beauty": _make_taxonomy_profile(
  338 + "beauty and cosmetics",
  339 + BEAUTY_TAXONOMY_FIELDS,
  340 + ),
  341 + "accessories": _make_taxonomy_profile(
  342 + "accessories",
  343 + ACCESSORIES_TAXONOMY_FIELDS,
  344 + ),
  345 + "toys": _make_taxonomy_profile(
  346 + "toys",
  347 + TOYS_TAXONOMY_FIELDS,
  348 + ),
  349 + "shoes": _make_taxonomy_profile(
  350 + "shoes",
  351 + SHOES_TAXONOMY_FIELDS,
  352 + ),
  353 + "sports": _make_taxonomy_profile(
  354 + "sports products",
  355 + SPORTS_TAXONOMY_FIELDS,
  356 + ),
  357 + "others": _make_taxonomy_profile(
  358 + "general merchandise",
  359 + OTHERS_TAXONOMY_FIELDS,
  360 + ),
  361 +}
  362 +
  363 +TAXONOMY_SHARED_ANALYSIS_INSTRUCTION = CATEGORY_TAXONOMY_PROFILES["apparel"]["shared_instruction"]
  364 +TAXONOMY_MARKDOWN_TABLE_HEADERS_EN = CATEGORY_TAXONOMY_PROFILES["apparel"]["markdown_table_headers"]["en"]
  365 +TAXONOMY_LANGUAGE_MARKDOWN_TABLE_HEADERS: Dict[str, Dict[str, Any]] = CATEGORY_TAXONOMY_PROFILES["apparel"]["markdown_table_headers"]
  366 +
36 367 LANGUAGE_MARKDOWN_TABLE_HEADERS: Dict[str, Dict[str, Any]] = {
37 368 "en": [
38 369 "No.",
... ...
indexer/product_enrich模块说明.md 0 → 100644
... ... @@ -0,0 +1,173 @@
  1 +# 内容富化模块说明
  2 +
  3 +本文说明商品内容富化模块的职责、入口、输出结构,以及当前 taxonomy profile 的设计约束。
  4 +
  5 +## 1. 模块目标
  6 +
  7 +内容富化模块负责基于商品文本调用 LLM,生成以下索引字段:
  8 +
  9 +- `qanchors`
  10 +- `enriched_tags`
  11 +- `enriched_attributes`
  12 +- `enriched_taxonomy_attributes`
  13 +
  14 +模块追求的设计原则:
  15 +
  16 +- 单一职责:只负责内容理解与结构化输出,不负责 CSV 读写
  17 +- 输出对齐 ES mapping:返回结构可直接写入 `search_products`
  18 +- 配置化扩展:taxonomy profile 通过数据配置扩展,而不是散落条件分支
  19 +- 代码精简:只面向正常使用方式,避免为了不合理调用堆叠补丁逻辑
  20 +
  21 +## 2. 主要文件
  22 +
  23 +- [product_enrich.py](/data/saas-search/indexer/product_enrich.py)
  24 + 运行时主逻辑,负责批处理、缓存、prompt 组装、LLM 调用、markdown 解析、输出整理
  25 +- [product_enrich_prompts.py](/data/saas-search/indexer/product_enrich_prompts.py)
  26 + prompt 模板与 taxonomy profile 配置
  27 +- [document_transformer.py](/data/saas-search/indexer/document_transformer.py)
  28 + 在内部索引构建链路中调用内容富化模块,把结果回填到 ES doc
  29 +- [taxonomy.md](/data/saas-search/indexer/taxonomy.md)
  30 + taxonomy 设计说明与字段清单
  31 +
  32 +## 3. 对外入口
  33 +
  34 +### 3.1 Python 入口
  35 +
  36 +核心入口:
  37 +
  38 +```python
  39 +build_index_content_fields(
  40 + items,
  41 + tenant_id=None,
  42 + enrichment_scopes=None,
  43 + category_taxonomy_profile=None,
  44 +)
  45 +```
  46 +
  47 +输入最小要求:
  48 +
  49 +- `id` 或 `spu_id`
  50 +- `title`
  51 +
  52 +可选输入:
  53 +
  54 +- `brief`
  55 +- `description`
  56 +- `image_url`
  57 +
  58 +关键参数:
  59 +
  60 +- `enrichment_scopes`
  61 + 可选 `generic`、`category_taxonomy`
  62 +- `category_taxonomy_profile`
  63 + taxonomy profile;默认 `apparel`
  64 +
  65 +### 3.2 HTTP 入口
  66 +
  67 +API 路由:
  68 +
  69 +- `POST /indexer/enrich-content`
  70 +
  71 +对应文档:
  72 +
  73 +- [搜索API对接指南-05-索引接口(Indexer)](/data/saas-search/docs/搜索API对接指南-05-索引接口(Indexer).md)
  74 +- [搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation)](/data/saas-search/docs/搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md)
  75 +
  76 +## 4. 输出结构
  77 +
  78 +返回结果与 ES mapping 对齐:
  79 +
  80 +```json
  81 +{
  82 + "id": "223167",
  83 + "qanchors": {
  84 + "zh": ["短袖T恤", "纯棉"],
  85 + "en": ["t-shirt", "cotton"]
  86 + },
  87 + "enriched_tags": {
  88 + "zh": ["短袖", "纯棉"],
  89 + "en": ["short sleeve", "cotton"]
  90 + },
  91 + "enriched_attributes": [
  92 + {
  93 + "name": "enriched_tags",
  94 + "value": {
  95 + "zh": ["短袖", "纯棉"],
  96 + "en": ["short sleeve", "cotton"]
  97 + }
  98 + }
  99 + ],
  100 + "enriched_taxonomy_attributes": [
  101 + {
  102 + "name": "Product Type",
  103 + "value": {
  104 + "zh": ["T恤"],
  105 + "en": ["t-shirt"]
  106 + }
  107 + }
  108 + ]
  109 +}
  110 +```
  111 +
  112 +说明:
  113 +
  114 +- `generic` 部分固定输出核心索引语言 `zh`、`en`
  115 +- `taxonomy` 部分同样统一输出 `zh`、`en`
  116 +
  117 +## 5. Taxonomy profile
  118 +
  119 +当前支持:
  120 +
  121 +- `apparel`
  122 +- `3c`
  123 +- `bags`
  124 +- `pet_supplies`
  125 +- `electronics`
  126 +- `outdoor`
  127 +- `home_appliances`
  128 +- `home_living`
  129 +- `wigs`
  130 +- `beauty`
  131 +- `accessories`
  132 +- `toys`
  133 +- `shoes`
  134 +- `sports`
  135 +- `others`
  136 +
  137 +统一约束:
  138 +
  139 +- 所有 profile 都返回 `zh` + `en`
  140 +- profile 只决定 taxonomy 字段集合,不再决定输出语言
  141 +- 所有 profile 都配置中英文字段名,prompt/header 结构保持一致
  142 +
  143 +## 6. 内部索引链路的当前约束
  144 +
  145 +在内部 ES 文档构建链路里,`document_transformer` 当前调用内容富化时,taxonomy profile 暂时固定传:
  146 +
  147 +```python
  148 +category_taxonomy_profile="apparel"
  149 +```
  150 +
  151 +这是一种显式、可控、代码更干净的临时策略。
  152 +
  153 +当前代码里已保留 TODO:
  154 +
  155 +- 后续从数据库读取租户真实所属行业
  156 +- 再用该行业替换固定的 `apparel`
  157 +
  158 +当前不做“根据商品类目文本自动猜 profile”的隐式逻辑,避免增加冗余代码与不必要的不确定性。
  159 +
  160 +## 7. 缓存与批处理
  161 +
  162 +缓存键由以下信息共同决定:
  163 +
  164 +- `analysis_kind`
  165 +- `target_lang`
  166 +- prompt/schema 版本指纹
  167 +- prompt 实际输入文本
  168 +
  169 +批处理规则:
  170 +
  171 +- 单次 LLM 调用最多 20 条
  172 +- 上层允许传更大批次,模块内部自动拆批
  173 +- uncached batch 可并发执行
... ...
indexer/taxonomy.md 0 → 100644
... ... @@ -0,0 +1,196 @@
  1 +
  2 +# Cross-Border E-commerce Core Categories 大类
  3 +
  4 +## 1. 3C
  5 +Phone accessories, computer peripherals, smart wearables, audio & video, smart home, gaming gear. 手机配件、电脑周边、智能穿戴、影音娱乐、智能家居、游戏设备。
  6 +
  7 +## 2. Bags 包
  8 +Handbags, backpacks, wallets, luggage, crossbody bags, tote bags. 手提包、双肩包、钱包、行李箱、斜挎包、托特包。
  9 +
  10 +## 3. Pet Supplies 宠物用品
  11 +Pet food, pet toys, pet care products, pet grooming, pet clothing, smart pet devices. 宠物食品、宠物玩具、宠物护理用品、宠物美容、宠物服装、智能宠物设备。
  12 +
  13 +## 4. Electronics 电子产品
  14 +Consumer electronics, home appliances, digital devices, cables & chargers, batteries, electronic components. 消费电子产品、家用电器、数码设备、线材充电器、电池、电子元器件。
  15 +
  16 +## 5. Clothing 服装
  17 +Women's wear, men's wear, kid's wear, underwear, outerwear, activewear. 女装、男装、童装、内衣、外套、运动服装。
  18 +
  19 +## 6. Outdoor 户外用品
  20 +Camping gear, hiking equipment, fishing supplies, outdoor clothing, travel accessories, survival tools. 露营装备、徒步用品、渔具、户外服装、旅行配件、求生工具。
  21 +
  22 +## 7. Home Appliances 家电/电器
  23 +Kitchen appliances, cleaning appliances, personal care appliances, heating & cooling, smart home devices. 厨房电器、清洁电器、个护电器、冷暖设备、智能家居设备。
  24 +
  25 +## 8. Home & Living 家居
  26 +Furniture, home textiles, lighting, kitchenware, storage, home decor. 家具、家纺、灯具、厨具、收纳、家居装饰。
  27 +
  28 +## 9. Wigs 假发
  29 +
  30 +## 10. Beauty & Cosmetics 美容美妆
  31 +Skincare, makeup, nail care, beauty tools, hair care, fragrances. 护肤品、彩妆、美甲、美容工具、护发、香水。
  32 +
  33 +## 11. Accessories 配饰
  34 +Jewelry, watches, belts, scarves, hats, sunglasses, hair accessories. 珠宝、手表、腰带、围巾、帽子、太阳镜、发饰。
  35 +
  36 +## 12. Toys 玩具
  37 +Educational toys, plush toys, action figures, puzzles, outdoor toys, DIY toys. 益智玩具、毛绒玩具、可动人偶、拼图、户外玩具、DIY玩具。
  38 +
  39 +## 13. Shoes 鞋子
  40 +Sneakers, boots, sandals, heels, flats, sports shoes. 运动鞋、靴子、凉鞋、高跟鞋、平底鞋、球鞋。
  41 +
  42 +## 14. Sports 运动产品
  43 +Fitness equipment, sports gear, team sports, racquet sports, water sports, cycling. 健身器材、运动装备、团队运动、球拍运动、水上运动、骑行。
  44 +
  45 +## 15. Others 其他
  46 +
  47 +# 各个大类的taxonomy
  48 +## 1. Clothing & Apparel 服装
  49 +
  50 +### A. Product Classification
  51 +
  52 +| 一级层级 | 中文列名 | English Column Name |
  53 +| ------------------------- | ---- | ------------------- |
  54 +| A. Product Classification | 品类 | Product Type |
  55 +| A. Product Classification | 目标性别 | Target Gender |
  56 +| A. Product Classification | 年龄段 | Age Group |
  57 +| A. Product Classification | 适用季节 | Season |
  58 +
  59 +### B. Garment Design
  60 +
  61 +| 一级层级 | 中文列名 | English Column Name |
  62 +| ----------------- | ---- | ------------------- |
  63 +| B. Garment Design | 版型 | Fit |
  64 +| B. Garment Design | 廓形 | Silhouette |
  65 +| B. Garment Design | 领型 | Neckline |
  66 +| B. Garment Design | 袖型 | Sleeve Style |
  67 +| B. Garment Design | 肩带设计 | Strap Type |
  68 +| B. Garment Design | 腰型 | Rise / Waistline |
  69 +| B. Garment Design | 裤型 | Leg Shape |
  70 +| B. Garment Design | 裙型 | Skirt Shape |
  71 +| B. Garment Design | 长度 | Length Type |
  72 +| B. Garment Design | 闭合方式 | Closure Type |
  73 +| B. Garment Design | 设计细节 | Design Details |
  74 +
  75 +### C. Material & Performance
  76 +
  77 +| 一级层级 | 中文列名 | English Column Name |
  78 +| ------------------------- | ----------- | -------------------- |
  79 +| C. Material & Performance | 面料 | Fabric |
  80 +| C. Material & Performance | 成分 | Material Composition |
  81 +| C. Material & Performance | 面料特性 | Fabric Properties |
  82 +| C. Material & Performance | 服装特征 / 功能细节 | Clothing Features |
  83 +| C. Material & Performance | 功能 | Functional Benefits |
  84 +
  85 +### D. Merchandising Attributes
  86 +
  87 +| 一级层级 | 中文列名 | English Column Name |
  88 +| --------------------------- | ------- | ------------------- |
  89 +| D. Merchandising Attributes | 主颜色 | Color |
  90 +| D. Merchandising Attributes | 色系 | Color Family |
  91 +| D. Merchandising Attributes | 印花 / 图案 | Print / Pattern |
  92 +| D. Merchandising Attributes | 适用场景 | Occasion / End Use |
  93 +| D. Merchandising Attributes | 风格 | Style Aesthetic |
  94 +
  95 +
  96 +
  97 +根据这个产生
  98 +enriched_taxonomy_attributes
  99 +
  100 +```python
  101 +Product Type
  102 +Target Gender
  103 +Age Group
  104 +Season
  105 +Fit
  106 +Silhouette
  107 +Neckline
  108 +Sleeve Length Type
  109 +Sleeve Style
  110 +Strap Type
  111 +Rise / Waistline
  112 +Leg Shape
  113 +Skirt Shape
  114 +Length Type
  115 +Closure Type
  116 +Design Details
  117 +Fabric
  118 +Material Composition
  119 +Fabric Properties
  120 +Clothing Features
  121 +Functional Benefits
  122 +Color
  123 +Color Family
  124 +Print / Pattern
  125 +Occasion / End Use
  126 +Style Aesthetic
  127 +```
  128 +
  129 +提示词:
  130 +
  131 +```python
  132 +SHARED_ANALYSIS_INSTRUCTION = """
  133 +Analyze each input product text and fill the columns below using an apparel attribute taxonomy.
  134 +
  135 +Output columns:
  136 +1. Product Type: concise ecommerce apparel category label, not a full marketing title
  137 +2. Target Gender: intended gender only if clearly implied
  138 +3. Age Group: only if clearly implied, e.g. adults, kids, teens, toddlers, babies
  139 +4. Season: season(s) or all-season suitability only if supported
  140 +5. Fit: body closeness, e.g. slim, regular, relaxed, oversized, fitted
  141 +6. Silhouette: overall garment shape, e.g. straight, A-line, boxy, tapered, bodycon, wide-leg
  142 +7. Neckline: neckline type when applicable, e.g. crew neck, V-neck, hooded, collared, square neck
  143 +8. Sleeve Length Type: sleeve length only, e.g. sleeveless, short sleeve, long sleeve, three-quarter sleeve
  144 +9. Sleeve Style: sleeve design only, e.g. puff sleeve, raglan sleeve, batwing sleeve, bell sleeve
  145 +10. Strap Type: strap design when applicable, e.g. spaghetti strap, wide strap, halter strap, adjustable strap
  146 +11. Rise / Waistline: waist placement when applicable, e.g. high rise, mid rise, low rise, empire waist
  147 +12. Leg Shape: for bottoms only, e.g. straight leg, wide leg, flare leg, tapered leg, skinny leg
  148 +13. Skirt Shape: for skirts only, e.g. A-line, pleated, pencil, mermaid
  149 +14. Length Type: design length only, not size, e.g. cropped, regular, longline, mini, midi, maxi, ankle length, full length
  150 +15. Closure Type: fastening method when applicable, e.g. zipper, button, drawstring, elastic waist, hook-and-loop
  151 +16. Design Details: construction or visual details, e.g. ruched, ruffled, pleated, cut-out, layered, distressed, split hem
  152 +17. Fabric: fabric type only, e.g. denim, knit, chiffon, jersey, fleece, cotton twill
  153 +18. Material Composition: fiber content or blend only if stated, e.g. cotton, polyester, spandex, linen blend, 95% cotton 5% elastane
  154 +19. Fabric Properties: inherent fabric traits, e.g. stretch, breathable, lightweight, soft-touch, water-resistant
  155 +20. Clothing Features: product features, e.g. lined, reversible, hooded, packable, padded, pocketed
  156 +21. Functional Benefits: wearer benefits, e.g. moisture-wicking, thermal insulation, UV protection, easy care, supportive compression
  157 +22. Color: specific color name when available
  158 +23. Color Family: normalized broad retail color group, e.g. black, white, blue, green, red, pink, beige, brown, gray
  159 +24. Print / Pattern: surface pattern when applicable, e.g. solid, striped, plaid, floral, graphic, animal print
  160 +25. Occasion / End Use: likely use occasion only if supported, e.g. office, casual wear, streetwear, lounge, workout, outdoor
  161 +26. Style Aesthetic: overall style only if supported, e.g. minimalist, streetwear, athleisure, smart casual, romantic, playful
  162 +
  163 +Rules:
  164 +- Keep the same row order and row count as input.
  165 +- Infer only from the provided product text.
  166 +- Leave blank if not applicable or not reasonably supported.
  167 +- Use concise, standardized English ecommerce wording.
  168 +- Do not combine different attribute dimensions in one field.
  169 +- If multiple values are needed, use the delimiter required by the localization setting.
  170 +
  171 +Input product list:
  172 +"""
  173 +```
  174 +
  175 +## 2. Other taxonomy profiles
  176 +
  177 +说明:
  178 +- 所有 profile 统一返回 `zh` + `en`。
  179 +- 代码中的 profile slug 与下面保持一致。
  180 +
  181 +| Profile | Core columns (`en`) |
  182 +| --- | --- |
  183 +| `3c` | Product Type, Compatible Device / Model, Connectivity, Interface / Port Type, Power Source / Charging, Key Features, Material / Finish, Color, Pack Size, Use Case |
  184 +| `bags` | Product Type, Target Gender, Carry Style, Size / Capacity, Material, Closure Type, Structure / Compartments, Strap / Handle Type, Color, Occasion / End Use |
  185 +| `pet_supplies` | Product Type, Pet Type, Breed Size, Life Stage, Material / Ingredients, Flavor / Scent, Key Features, Functional Benefits, Size / Capacity, Use Scenario |
  186 +| `electronics` | Product Type, Device Category / Compatibility, Power / Voltage, Connectivity, Interface / Port Type, Capacity / Storage, Key Features, Material / Finish, Color, Use Case |
  187 +| `outdoor` | Product Type, Activity Type, Season / Weather, Material, Capacity / Size, Protection / Resistance, Key Features, Portability / Packability, Color, Use Scenario |
  188 +| `home_appliances` | Product Type, Appliance Category, Power / Voltage, Capacity / Coverage, Control Method, Installation Type, Key Features, Material / Finish, Color, Use Scenario |
  189 +| `home_living` | Product Type, Room / Placement, Material, Style, Size / Dimensions, Color, Pattern / Finish, Key Features, Assembly / Installation, Use Scenario |
  190 +| `wigs` | Product Type, Hair Material, Hair Texture, Hair Length, Hair Color, Cap Construction, Lace Area / Part Type, Density / Volume, Style / Bang Type, Occasion / End Use |
  191 +| `beauty` | Product Type, Target Area, Skin Type / Hair Type, Finish / Effect, Key Ingredients, Shade / Color, Scent, Formulation, Functional Benefits, Use Scenario |
  192 +| `accessories` | Product Type, Target Gender, Material, Color, Pattern / Finish, Closure / Fastening, Size / Fit, Style, Occasion / End Use, Set / Pack Size |
  193 +| `toys` | Product Type, Age Group, Character / Theme, Material, Power Source, Interactive Features, Educational / Play Value, Piece Count / Size, Color, Use Scenario |
  194 +| `shoes` | Product Type, Target Gender, Age Group, Closure Type, Toe Shape, Heel Height / Sole Type, Upper Material, Lining / Insole Material, Color, Occasion / End Use |
  195 +| `sports` | Product Type, Sport / Activity, Skill Level, Material, Size / Capacity, Protection / Support, Key Features, Power Source, Color, Use Scenario |
  196 +| `others` | Product Type, Product Category, Target User, Material / Ingredients, Key Features, Functional Benefits, Size / Capacity, Color, Style / Theme, Use Scenario |
... ...
mappings/README.md
... ... @@ -68,6 +68,7 @@
68 68 - `option2_values`
69 69 - `option3_values`
70 70 - `enriched_attributes.value`
  71 +- `enriched_taxonomy_attributes.value`
71 72 - `specifications.value_text`
72 73  
73 74 以 `category_path` 和 `option*_values` 为例,核心语言灌入结果应至少包含:
... ...
mappings/generate_search_products_mapping.py
... ... @@ -214,6 +214,11 @@ FIELD_SPECS = [
214 214 scalar_field("name", "keyword"),
215 215 text_field("value", "core_language_text_with_keyword"),
216 216 ),
  217 + nested_field(
  218 + "enriched_taxonomy_attributes",
  219 + scalar_field("name", "keyword"),
  220 + text_field("value", "core_language_text_with_keyword"),
  221 + ),
217 222 scalar_field("option1_name", "keyword"),
218 223 scalar_field("option2_name", "keyword"),
219 224 scalar_field("option3_name", "keyword"),
... ...
mappings/search_products.json
... ... @@ -2116,6 +2116,40 @@
2116 2116 }
2117 2117 }
2118 2118 },
  2119 + "enriched_taxonomy_attributes": {
  2120 + "type": "nested",
  2121 + "properties": {
  2122 + "name": {
  2123 + "type": "keyword"
  2124 + },
  2125 + "value": {
  2126 + "type": "object",
  2127 + "properties": {
  2128 + "zh": {
  2129 + "type": "text",
  2130 + "analyzer": "index_ik",
  2131 + "search_analyzer": "query_ik",
  2132 + "fields": {
  2133 + "keyword": {
  2134 + "type": "keyword",
  2135 + "normalizer": "lowercase"
  2136 + }
  2137 + }
  2138 + },
  2139 + "en": {
  2140 + "type": "text",
  2141 + "analyzer": "english",
  2142 + "fields": {
  2143 + "keyword": {
  2144 + "type": "keyword",
  2145 + "normalizer": "lowercase"
  2146 + }
  2147 + }
  2148 + }
  2149 + }
  2150 + }
  2151 + }
  2152 + },
2119 2153 "option1_name": {
2120 2154 "type": "keyword"
2121 2155 },
... ...
perf_reports/20260311/reranker_1000docs/report.md
... ... @@ -34,5 +34,5 @@ Workload profile:
34 34 ## Reproduce
35 35  
36 36 ```bash
37   -./scripts/benchmark_reranker_1000docs.sh
  37 +./benchmarks/reranker/benchmark_reranker_1000docs.sh
38 38 ```
... ...
perf_reports/20260317/translation_local_models/README.md
1 1 # Local Translation Model Benchmark Report
2 2  
3   -Test script: [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py)
  3 +Test script: [`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py)
4 4  
5 5 Test time: `2026-03-17`
6 6  
... ... @@ -67,7 +67,7 @@ To model online search query translation, we reran NLLB with `batch_size=1`. In
67 67 Command used:
68 68  
69 69 ```bash
70   -./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
  70 +./.venv-translator/bin/python benchmarks/translation/benchmark_translation_local_models.py \
71 71 --single \
72 72 --model nllb-200-distilled-600m \
73 73 --source-lang zh \
... ...
perf_reports/20260318/nllb_t4_product_names_ct2/README.md
1 1 # NLLB T4 Product-Name Tuning Summary
2 2  
3 3 测试脚本:
4   -- [`scripts/benchmark_nllb_t4_tuning.py`](/data/saas-search/scripts/benchmark_nllb_t4_tuning.py)
  4 +- [`benchmarks/translation/benchmark_nllb_t4_tuning.py`](/data/saas-search/benchmarks/translation/benchmark_nllb_t4_tuning.py)
5 5  
6 6 本轮报告:
7 7 - Markdown:[`nllb_t4_tuning_003608.md`](/data/saas-search/perf_reports/20260318/nllb_t4_product_names_ct2/nllb_t4_tuning_003608.md)
... ...
perf_reports/20260318/translation_local_models/README.md
1 1 # Local Translation Model Benchmark Report
2 2  
3 3 测试脚本:
4   -- [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py)
  4 +- [`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py)
5 5  
6 6 完整结果:
7 7 - Markdown:[`translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md)
... ... @@ -39,7 +39,7 @@
39 39  
40 40 ```bash
41 41 cd /data/saas-search
42   -./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
  42 +./.venv-translator/bin/python benchmarks/translation/benchmark_translation_local_models.py \
43 43 --suite extended \
44 44 --disable-cache \
45 45 --serial-items-per-case 256 \
... ...
perf_reports/20260318/translation_local_models_ct2/README.md
1 1 # Local Translation Model Benchmark Report (CTranslate2)
2 2  
3 3 测试脚本:
4   -- [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py)
  4 +- [`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py)
5 5  
6 6 本轮 CT2 结果:
7 7 - Markdown:[`translation_local_models_ct2_extended_233253.md`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2/translation_local_models_ct2_extended_233253.md)
... ... @@ -46,7 +46,7 @@ from datetime import datetime
46 46 from pathlib import Path
47 47 from types import SimpleNamespace
48 48  
49   -from scripts.benchmark_translation_local_models import (
  49 +from benchmarks.translation.benchmark_translation_local_models import (
50 50 SCENARIOS,
51 51 benchmark_extended_scenario,
52 52 build_environment_info,
... ...
perf_reports/20260318/translation_local_models_ct2_focus/README.md
1 1 # Local Translation Model Focused T4 Tuning
2 2  
3 3 测试脚本:
4   -- [`scripts/benchmark_translation_local_models_focus.py`](/data/saas-search/scripts/benchmark_translation_local_models_focus.py)
  4 +- [`benchmarks/translation/benchmark_translation_local_models_focus.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models_focus.py)
5 5  
6 6 本轮聚焦结果:
7 7 - Markdown:[`translation_local_models_focus_235018.md`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2_focus/translation_local_models_focus_235018.md)
... ...
perf_reports/README.md
... ... @@ -4,7 +4,7 @@
4 4  
5 5 | 脚本 | 用途 |
6 6 |------|------|
7   -| `scripts/perf_api_benchmark.py` | 搜索后端、向量、翻译、重排等 HTTP 接口压测;支持 `--embed-text-priority` / `--embed-image-priority` 与 `scripts/perf_cases.json.example` |
  7 +| `benchmarks/perf_api_benchmark.py` | 搜索后端、向量、翻译、重排等 HTTP 接口压测;支持 `--embed-text-priority` / `--embed-image-priority` 与 `benchmarks/perf_cases.json.example` |
8 8  
9 9 历史矩阵示例(并发扫描):
10 10  
... ... @@ -25,10 +25,10 @@
25 25  
26 26 ```bash
27 27 source activate.sh
28   -python scripts/perf_api_benchmark.py --scenario embed_text --duration 8 --concurrency 10 --timeout 30 --output perf_reports/2026-03-20_embed_text_p0.json
29   -python scripts/perf_api_benchmark.py --scenario embed_text --duration 8 --concurrency 10 --embed-text-priority 1 --output perf_reports/2026-03-20_embed_text_p1.json
30   -python scripts/perf_api_benchmark.py --scenario embed_image --duration 8 --concurrency 5 --timeout 60 --output perf_reports/2026-03-20_embed_image_p0.json
31   -python scripts/perf_api_benchmark.py --scenario embed_image --duration 8 --concurrency 5 --embed-image-priority 1 --output perf_reports/2026-03-20_embed_image_p1.json
  28 +python benchmarks/perf_api_benchmark.py --scenario embed_text --duration 8 --concurrency 10 --timeout 30 --output perf_reports/2026-03-20_embed_text_p0.json
  29 +python benchmarks/perf_api_benchmark.py --scenario embed_text --duration 8 --concurrency 10 --embed-text-priority 1 --output perf_reports/2026-03-20_embed_text_p1.json
  30 +python benchmarks/perf_api_benchmark.py --scenario embed_image --duration 8 --concurrency 5 --timeout 60 --output perf_reports/2026-03-20_embed_image_p0.json
  31 +python benchmarks/perf_api_benchmark.py --scenario embed_image --duration 8 --concurrency 5 --embed-image-priority 1 --output perf_reports/2026-03-20_embed_image_p1.json
32 32 ```
33 33  
34 34 说明:本次为 **8 秒 smoke**,与 `2026-03-12` 矩阵的时长/并发不可直接横向对比;仅验证 `priority` 参数下服务仍返回 200 且 payload 校验通过。
... ...
perf_reports/reranker_vllm_instruction/2026-03-25/RESULTS.md
... ... @@ -25,7 +25,7 @@ Shared across both backends for this run:
25 25  
26 26 ## Methodology
27 27  
28   -- Script: `python scripts/benchmark_reranker_random_titles.py 100,200,400,600,800,1000 --repeat 5` with **`--seed 99`** (see note below), **`--quiet-runs`**, **`--timeout 360`**.
  28 +- Script: `python benchmarks/reranker/benchmark_reranker_random_titles.py 100,200,400,600,800,1000 --repeat 5` with **`--seed 99`** (see note below), **`--quiet-runs`**, **`--timeout 360`**.
29 29 - Titles: default file `/home/ubuntu/rerank_test/titles.1.8w` (one title per line).
30 30 - Query: default `健身女生T恤短袖`.
31 31 - Each scenario: **3 warm-up** requests at `n=400` (not timed), then **5 timed** runs per `n`.
... ... @@ -56,9 +56,9 @@ JSON aggregates (means, stdev, raw `values_ms`): same directory, `qwen3_vllm_{co
56 56 ## Tooling added / changed
57 57  
58 58 - `reranker/server.py`: `/health` includes `instruction_format` when the active backend sets `_instruction_format`.
59   -- `scripts/benchmark_reranker_random_titles.py`: `--tag`, `--json-summary-out`, `--quiet-runs`.
60   -- `scripts/patch_rerank_vllm_benchmark_config.py`: surgical YAML patch (preserves newlines).
61   -- `scripts/run_reranker_vllm_instruction_benchmark.sh`: full matrix driver (continues if a benchmark exits non-zero; uses `--timeout 360`).
  59 +- `benchmarks/reranker/benchmark_reranker_random_titles.py`: `--tag`, `--json-summary-out`, `--quiet-runs`.
  60 +- `benchmarks/reranker/patch_rerank_vllm_benchmark_config.py`: surgical YAML patch (preserves newlines).
  61 +- `benchmarks/reranker/run_reranker_vllm_instruction_benchmark.sh`: full matrix driver (continues if a benchmark exits non-zero; uses `--timeout 360`).
62 62  
63 63 ---
64 64  
... ... @@ -73,7 +73,7 @@ JSON aggregates (means, stdev, raw `values_ms`): same directory, `qwen3_vllm_{co
73 73 | Attention | Backend forced / steered attention on T4 (e.g. `TRITON_ATTN` path) | **No** `attention_config` in `LLM(...)`; vLLM **auto** — on this T4 run, logs show **`FLASHINFER`** |
74 74 | Config surface | `vllm_attention_backend` / `RERANK_VLLM_ATTENTION_BACKEND` 等 | **Removed**(少 YAML/环境变量分支,逻辑收敛) |
75 75 | Code default `instruction_format` | `qwen3_vllm_score` 默认 `standard` | 与 `qwen3_vllm` 对齐为 **`compact`**(仍可在 YAML 写 `standard`) |
76   -| Smoke / 启动 | — | `scripts/smoke_qwen3_vllm_score_backend.py`;`scripts/start_reranker.sh` 将 **venv `bin` 置于 `PATH`**(FLASHINFER JIT 依赖 venv 内的 `ninja`) |
  76 +| Smoke / 启动 | — | `benchmarks/reranker/smoke_qwen3_vllm_score_backend.py`;`scripts/start_reranker.sh` 将 **venv `bin` 置于 `PATH`**(FLASHINFER JIT 依赖 venv 内的 `ninja`) |
77 77  
78 78 Micro-benchmark (same machine, isolated): **~927.5 ms → ~673.1 ms** at **n=400** docs on `LLM.score()` steady state (~**28%**), after removing the forced attention path and letting vLLM pick **FLASHINFER**.
79 79  
... ...
requirements_translator_service.txt
... ... @@ -13,7 +13,8 @@ httpx&gt;=0.24.0
13 13 tqdm>=4.65.0
14 14  
15 15 torch>=2.0.0
16   -transformers>=4.30.0
  16 +# Keep translator conversions on the last verified NLLB-compatible release line.
  17 +transformers>=4.51.0,<4.52.0
17 18 ctranslate2>=4.7.0
18 19 sentencepiece>=0.2.0
19 20 sacremoses>=0.1.1
... ...
reranker/DEPLOYMENT_AND_TUNING.md
... ... @@ -109,7 +109,7 @@ curl -sS http://127.0.0.1:6007/health
109 109 ### 5.1 使用一键压测脚本
110 110  
111 111 ```bash
112   -./scripts/benchmark_reranker_1000docs.sh
  112 +./benchmarks/reranker/benchmark_reranker_1000docs.sh
113 113 ```
114 114  
115 115 输出目录:
... ...
reranker/GGUF_0_6B_INSTALL_AND_TUNING.md
... ... @@ -144,7 +144,7 @@ qwen3_gguf_06b:
144 144  
145 145 ```bash
146 146 PYTHONPATH=/data/saas-search ./.venv-reranker-gguf/bin/python \
147   - scripts/benchmark_reranker_gguf_local.py --backend-name qwen3_gguf_06b --docs 400
  147 + benchmarks/reranker/benchmark_reranker_gguf_local.py --backend-name qwen3_gguf_06b --docs 400
148 148 ```
149 149  
150 150 按服务方式启动:
... ...
reranker/GGUF_INSTALL_AND_TUNING.md
... ... @@ -117,7 +117,7 @@ HF_HUB_DISABLE_XET=1
117 117  
118 118 ```bash
119 119 PYTHONPATH=/data/saas-search ./.venv-reranker-gguf/bin/python \
120   - scripts/benchmark_reranker_gguf_local.py --docs 64 --repeat 1
  120 + benchmarks/reranker/benchmark_reranker_gguf_local.py --docs 64 --repeat 1
121 121 ```
122 122  
123 123 它会直接实例化 GGUF backend,输出:
... ... @@ -134,7 +134,7 @@ PYTHONPATH=/data/saas-search ./.venv-reranker-gguf/bin/python \
134 134  
135 135 - Query: `白色oversized T-shirt`
136 136 - Docs: `64` 条商品标题
137   -- 本地脚本:`scripts/benchmark_reranker_gguf_local.py`
  137 +- 本地脚本:`benchmarks/reranker/benchmark_reranker_gguf_local.py`
138 138 - 每组 1 次,重点比较相对趋势
139 139  
140 140 结果:
... ... @@ -195,7 +195,7 @@ n_gpu_layers=999
195 195  
196 196 ```bash
197 197 RERANK_BASE=http://127.0.0.1:6007 \
198   - ./.venv/bin/python scripts/benchmark_reranker_random_titles.py 64 --repeat 1 --query '白色oversized T-shirt'
  198 + ./.venv/bin/python benchmarks/reranker/benchmark_reranker_random_titles.py 64 --repeat 1 --query '白色oversized T-shirt'
199 199 ```
200 200  
201 201 得到:
... ... @@ -206,7 +206,7 @@ RERANK_BASE=http://127.0.0.1:6007 \
206 206  
207 207 ```bash
208 208 RERANK_BASE=http://127.0.0.1:6007 \
209   - ./.venv/bin/python scripts/benchmark_reranker_random_titles.py 153 --repeat 1 --query '白色oversized T-shirt'
  209 + ./.venv/bin/python benchmarks/reranker/benchmark_reranker_random_titles.py 153 --repeat 1 --query '白色oversized T-shirt'
210 210 ```
211 211  
212 212 得到:
... ... @@ -276,5 +276,5 @@ offload_kqv: true
276 276 - `config/config.yaml`
277 277 - `scripts/setup_reranker_venv.sh`
278 278 - `scripts/start_reranker.sh`
279   -- `scripts/benchmark_reranker_gguf_local.py`
  279 +- `benchmarks/reranker/benchmark_reranker_gguf_local.py`
280 280 - `reranker/GGUF_INSTALL_AND_TUNING.md`
... ...
reranker/README.md
... ... @@ -46,9 +46,9 @@ Reranker 服务提供统一的 `/rerank` API,支持可插拔后端(BGE、Jin
46 46 - `backends/dashscope_rerank.py`:DashScope 云端重排后端
47 47 - `scripts/setup_reranker_venv.sh`:按后端创建独立 venv
48 48 - `scripts/start_reranker.sh`:启动 reranker 服务
49   -- `scripts/smoke_qwen3_vllm_score_backend.py`:`qwen3_vllm_score` 本地 smoke
50   -- `scripts/benchmark_reranker_random_titles.py`:随机标题压测脚本
51   -- `scripts/run_reranker_vllm_instruction_benchmark.sh`:历史矩阵脚本
  49 +- `benchmarks/reranker/smoke_qwen3_vllm_score_backend.py`:`qwen3_vllm_score` 本地 smoke
  50 +- `benchmarks/reranker/benchmark_reranker_random_titles.py`:随机标题压测脚本
  51 +- `benchmarks/reranker/run_reranker_vllm_instruction_benchmark.sh`:历史矩阵脚本
52 52  
53 53 ## 环境基线
54 54  
... ... @@ -118,7 +118,7 @@ nvidia-smi
118 118 ### 4. Smoke
119 119  
120 120 ```bash
121   -PYTHONPATH=. ./.venv-reranker-score/bin/python scripts/smoke_qwen3_vllm_score_backend.py --gpu-memory-utilization 0.2
  121 +PYTHONPATH=. ./.venv-reranker-score/bin/python benchmarks/reranker/smoke_qwen3_vllm_score_backend.py --gpu-memory-utilization 0.2
122 122 ```
123 123  
124 124 ## `jina_reranker_v3`
... ...
scripts/README.md 0 → 100644
... ... @@ -0,0 +1,59 @@
  1 +# Scripts
  2 +
  3 +`scripts/` 现在只保留当前架构下仍然有效的运行、运维、环境和数据处理脚本,并按职责拆到稳定子目录,避免继续在根目录平铺。
  4 +
  5 +## 当前分类
  6 +
  7 +- 服务编排
  8 + - `service_ctl.sh`
  9 + - `start_backend.sh`
  10 + - `start_indexer.sh`
  11 + - `start_frontend.sh`
  12 + - `start_eval_web.sh`
  13 + - `start_embedding_service.sh`
  14 + - `start_embedding_text_service.sh`
  15 + - `start_embedding_image_service.sh`
  16 + - `start_reranker.sh`
  17 + - `start_translator.sh`
  18 + - `start_tei_service.sh`
  19 + - `start_cnclip_service.sh`
  20 + - `stop.sh`
  21 + - `stop_tei_service.sh`
  22 + - `stop_cnclip_service.sh`
  23 + - `frontend/`
  24 + - `ops/`
  25 +
  26 +- 环境初始化
  27 + - `create_venv.sh`
  28 + - `init_env.sh`
  29 + - `setup_embedding_venv.sh`
  30 + - `setup_reranker_venv.sh`
  31 + - `setup_translator_venv.sh`
  32 + - `setup_cnclip_venv.sh`
  33 +
  34 +- 数据与索引
  35 + - `create_tenant_index.sh`
  36 + - `build_suggestions.sh`
  37 + - `mock_data.sh`
  38 + - `data_import/`
  39 + - `inspect/`
  40 + - `maintenance/`
  41 +
  42 +- 评估与专项工具
  43 + - `evaluation/`
  44 + - `redis/`
  45 + - `debug/`
  46 + - `translation/`
  47 +
  48 +## 已迁移
  49 +
  50 +- 基准压测与 smoke 脚本:迁到 `benchmarks/`
  51 +- 手工接口试跑脚本:迁到 `tests/manual/`
  52 +
  53 +## 已清理
  54 +
  55 +- 历史备份目录:`indexer__old_2025_11/`
  56 +- 过时壳脚本:`start.sh`
  57 +- Conda 时代残留:`install_server_deps.sh`
  58 +
  59 +后续如果新增脚本,优先放到明确子目录,不再把 benchmark、manual、历史备份直接丢回根 `scripts/`。
... ...
scripts/data_import/README.md 0 → 100644
... ... @@ -0,0 +1,13 @@
  1 +# Data Import Scripts
  2 +
  3 +这一组脚本用于把外部商品数据或 CSV/XLSX 样本转换为 Shoplazza 导入格式。
  4 +
  5 +- `amazon_xlsx_to_shoplazza_xlsx.py`
  6 +- `competitor_xlsx_to_shoplazza_xlsx.py`
  7 +- `csv_to_excel.py`
  8 +- `csv_to_excel_multi_variant.py`
  9 +- `shoplazza_excel_template.py`
  10 +- `shoplazza_import_template.py`
  11 +- `tenant3_csv_to_shoplazza_xlsx.sh`
  12 +
  13 +这里是离线数据转换工具,不属于线上服务运维入口。
... ...
scripts/amazon_xlsx_to_shoplazza_xlsx.py renamed to scripts/data_import/amazon_xlsx_to_shoplazza_xlsx.py
... ... @@ -35,9 +35,10 @@ from pathlib import Path
35 35  
36 36 from openpyxl import load_workbook
37 37  
38   -# Allow running as `python scripts/xxx.py` without installing as a package
39   -sys.path.insert(0, str(Path(__file__).resolve().parent))
40   -from shoplazza_excel_template import create_excel_from_template_fast
  38 +REPO_ROOT = Path(__file__).resolve().parents[2]
  39 +sys.path.insert(0, str(REPO_ROOT))
  40 +
  41 +from scripts.data_import.shoplazza_excel_template import create_excel_from_template_fast
41 42  
42 43  
43 44 PREFERRED_OPTION_KEYS = [
... ... @@ -612,4 +613,3 @@ def main():
612 613 if __name__ == "__main__":
613 614 main()
614 615  
615   -
... ...
scripts/competitor_xlsx_to_shoplazza_xlsx.py renamed to scripts/data_import/competitor_xlsx_to_shoplazza_xlsx.py
... ... @@ -6,7 +6,7 @@ The input `data/mai_jia_jing_ling/products_data/*.xlsx` files are Amazon-format
6 6 (Parent/Child ASIN), not “competitor data”.
7 7  
8 8 Please use:
9   - - `scripts/amazon_xlsx_to_shoplazza_xlsx.py`
  9 + - `scripts/data_import/amazon_xlsx_to_shoplazza_xlsx.py`
10 10  
11 11 This wrapper simply forwards all CLI args to the correctly named script, so you
12 12 automatically get the latest performance improvements (fast read/write).
... ... @@ -15,13 +15,12 @@ automatically get the latest performance improvements (fast read/write).
15 15 import sys
16 16 from pathlib import Path
17 17  
18   -# Allow running as `python scripts/xxx.py` without installing as a package
19   -sys.path.insert(0, str(Path(__file__).resolve().parent))
  18 +REPO_ROOT = Path(__file__).resolve().parents[2]
  19 +sys.path.insert(0, str(REPO_ROOT))
20 20  
21   -from amazon_xlsx_to_shoplazza_xlsx import main as amazon_main
  21 +from scripts.data_import.amazon_xlsx_to_shoplazza_xlsx import main as amazon_main
22 22  
23 23  
24 24 if __name__ == "__main__":
25 25 amazon_main()
26 26  
27   -
... ...
scripts/csv_to_excel.py renamed to scripts/data_import/csv_to_excel.py
... ... @@ -22,12 +22,12 @@ from openpyxl import load_workbook
22 22 from openpyxl.styles import Font, Alignment
23 23 from openpyxl.utils import get_column_letter
24 24  
25   -# Shared helpers (keeps template writing consistent across scripts)
26   -from scripts.shoplazza_import_template import create_excel_from_template as _create_excel_from_template_shared
27   -from scripts.shoplazza_import_template import generate_handle as _generate_handle_shared
  25 +REPO_ROOT = Path(__file__).resolve().parents[2]
  26 +sys.path.insert(0, str(REPO_ROOT))
28 27  
29   -# Add parent directory to path
30   -sys.path.insert(0, str(Path(__file__).parent.parent))
  28 +# Shared helpers (keeps template writing consistent across scripts)
  29 +from scripts.data_import.shoplazza_import_template import create_excel_from_template as _create_excel_from_template_shared
  30 +from scripts.data_import.shoplazza_import_template import generate_handle as _generate_handle_shared
31 31  
32 32  
33 33 def clean_value(value):
... ... @@ -299,4 +299,3 @@ def main():
299 299  
300 300 if __name__ == '__main__':
301 301 main()
302   -
... ...
scripts/csv_to_excel_multi_variant.py renamed to scripts/data_import/csv_to_excel_multi_variant.py
... ... @@ -22,12 +22,12 @@ import itertools
22 22 from openpyxl import load_workbook
23 23 from openpyxl.styles import Alignment
24 24  
25   -# Shared helpers (keeps template writing consistent across scripts)
26   -from scripts.shoplazza_import_template import create_excel_from_template as _create_excel_from_template_shared
27   -from scripts.shoplazza_import_template import generate_handle as _generate_handle_shared
  25 +REPO_ROOT = Path(__file__).resolve().parents[2]
  26 +sys.path.insert(0, str(REPO_ROOT))
28 27  
29   -# Add parent directory to path
30   -sys.path.insert(0, str(Path(__file__).parent.parent))
  28 +# Shared helpers (keeps template writing consistent across scripts)
  29 +from scripts.data_import.shoplazza_import_template import create_excel_from_template as _create_excel_from_template_shared
  30 +from scripts.data_import.shoplazza_import_template import generate_handle as _generate_handle_shared
31 31  
32 32 # Color definitions
33 33 COLORS = [
... ... @@ -562,4 +562,3 @@ def main():
562 562  
563 563 if __name__ == '__main__':
564 564 main()
565   -
... ...
scripts/shoplazza_excel_template.py renamed to scripts/data_import/shoplazza_excel_template.py
scripts/shoplazza_import_template.py renamed to scripts/data_import/shoplazza_import_template.py
scripts/tenant3__csv_to_shoplazza_xlsx.sh renamed to scripts/data_import/tenant3_csv_to_shoplazza_xlsx.sh
... ... @@ -5,16 +5,16 @@ cd &quot;$(dirname &quot;$0&quot;)/..&quot;
5 5 source ./activate.sh
6 6  
7 7 # # 基本使用(生成所有数据)
8   -# python scripts/csv_to_excel.py
  8 +# python scripts/data_import/csv_to_excel.py
9 9  
10 10 # # 指定输出文件
11   -# python scripts/csv_to_excel.py --output tenant3_imports.xlsx
  11 +# python scripts/data_import/csv_to_excel.py --output tenant3_imports.xlsx
12 12  
13 13 # # 限制处理行数(用于测试)
14   -# python scripts/csv_to_excel.py --limit 100
  14 +# python scripts/data_import/csv_to_excel.py --limit 100
15 15  
16 16 # 指定CSV文件和模板文件
17   -python scripts/csv_to_excel.py \
  17 +python scripts/data_import/csv_to_excel.py \
18 18 --csv-file data/customer1/goods_with_pic.5years_congku.csv.shuf.1w \
19 19 --template docs/商品导入模板.xlsx \
20   - --output tenant3_imports.xlsx
21 20 \ No newline at end of file
  21 + --output tenant3_imports.xlsx
... ...
scripts/trace_indexer_calls.sh renamed to scripts/debug/trace_indexer_calls.sh
1 1 #!/bin/bash
2 2 #
3 3 # 排查「谁在调用索引服务」的脚本
4   -# 用法: ./scripts/trace_indexer_calls.sh
  4 +# 用法: ./scripts/debug/trace_indexer_calls.sh
5 5 #
6 6  
7 7 set -euo pipefail
... ...
scripts/download_translation_models.py 100755 → 100644
1 1 #!/usr/bin/env python3
2   -"""Download local translation models declared in services.translation.capabilities."""
  2 +"""Backward-compatible entrypoint for translation model downloads."""
3 3  
4 4 from __future__ import annotations
5 5  
6   -import argparse
7   -import os
  6 +import runpy
8 7 from pathlib import Path
9   -import shutil
10   -import subprocess
11   -import sys
12   -from typing import Iterable
13   -
14   -from huggingface_hub import snapshot_download
15   -
16   -PROJECT_ROOT = Path(__file__).resolve().parent.parent
17   -if str(PROJECT_ROOT) not in sys.path:
18   - sys.path.insert(0, str(PROJECT_ROOT))
19   -os.environ.setdefault("HF_HUB_DISABLE_XET", "1")
20   -
21   -from config.services_config import get_translation_config
22   -
23   -
24   -LOCAL_BACKENDS = {"local_nllb", "local_marian"}
25   -
26   -
27   -def iter_local_capabilities(selected: set[str] | None = None) -> Iterable[tuple[str, dict]]:
28   - cfg = get_translation_config()
29   - capabilities = cfg.get("capabilities", {}) if isinstance(cfg, dict) else {}
30   - for name, capability in capabilities.items():
31   - backend = str(capability.get("backend") or "").strip().lower()
32   - if backend not in LOCAL_BACKENDS:
33   - continue
34   - if selected and name not in selected:
35   - continue
36   - yield name, capability
37   -
38   -
39   -def _compute_ct2_output_dir(capability: dict) -> Path:
40   - custom = str(capability.get("ct2_model_dir") or "").strip()
41   - if custom:
42   - return Path(custom).expanduser()
43   - model_dir = Path(str(capability.get("model_dir") or "")).expanduser()
44   - compute_type = str(capability.get("ct2_compute_type") or capability.get("torch_dtype") or "default").strip().lower()
45   - normalized = compute_type.replace("_", "-")
46   - return model_dir / f"ctranslate2-{normalized}"
47   -
48   -
49   -def _resolve_converter_binary() -> str:
50   - candidate = shutil.which("ct2-transformers-converter")
51   - if candidate:
52   - return candidate
53   - venv_candidate = Path(sys.executable).absolute().parent / "ct2-transformers-converter"
54   - if venv_candidate.exists():
55   - return str(venv_candidate)
56   - raise RuntimeError(
57   - "ct2-transformers-converter was not found. "
58   - "Install ctranslate2 in the active Python environment first."
59   - )
60   -
61   -
62   -def convert_to_ctranslate2(name: str, capability: dict) -> None:
63   - model_id = str(capability.get("model_id") or "").strip()
64   - model_dir = Path(str(capability.get("model_dir") or "")).expanduser()
65   - model_source = str(model_dir if model_dir.exists() else model_id)
66   - output_dir = _compute_ct2_output_dir(capability)
67   - if (output_dir / "model.bin").exists():
68   - print(f"[skip-convert] {name} -> {output_dir}")
69   - return
70   - quantization = str(
71   - capability.get("ct2_conversion_quantization")
72   - or capability.get("ct2_compute_type")
73   - or capability.get("torch_dtype")
74   - or "default"
75   - ).strip()
76   - output_dir.parent.mkdir(parents=True, exist_ok=True)
77   - print(f"[convert] {name} -> {output_dir} ({quantization})")
78   - subprocess.run(
79   - [
80   - _resolve_converter_binary(),
81   - "--model",
82   - model_source,
83   - "--output_dir",
84   - str(output_dir),
85   - "--quantization",
86   - quantization,
87   - ],
88   - check=True,
89   - )
90   - print(f"[converted] {name}")
91   -
92   -
93   -def main() -> None:
94   - parser = argparse.ArgumentParser(description="Download local translation models")
95   - parser.add_argument("--all-local", action="store_true", help="Download all configured local translation models")
96   - parser.add_argument("--models", nargs="*", default=[], help="Specific capability names to download")
97   - parser.add_argument(
98   - "--convert-ctranslate2",
99   - action="store_true",
100   - help="Also convert the downloaded Hugging Face models into CTranslate2 format",
101   - )
102   - args = parser.parse_args()
103   -
104   - selected = {item.strip().lower() for item in args.models if item.strip()} or None
105   - if not args.all_local and not selected:
106   - parser.error("pass --all-local or --models <name> ...")
107   -
108   - for name, capability in iter_local_capabilities(selected):
109   - model_id = str(capability.get("model_id") or "").strip()
110   - model_dir = Path(str(capability.get("model_dir") or "")).expanduser()
111   - if not model_id or not model_dir:
112   - raise ValueError(f"Capability '{name}' must define model_id and model_dir")
113   - model_dir.parent.mkdir(parents=True, exist_ok=True)
114   - print(f"[download] {name} -> {model_dir} ({model_id})")
115   - snapshot_download(
116   - repo_id=model_id,
117   - local_dir=str(model_dir),
118   - )
119   - print(f"[done] {name}")
120   - if args.convert_ctranslate2:
121   - convert_to_ctranslate2(name, capability)
122 8  
123 9  
124 10 if __name__ == "__main__":
125   - main()
  11 + target = Path(__file__).resolve().parent / "translation" / "download_translation_models.py"
  12 + runpy.run_path(str(target), run_name="__main__")
... ...
scripts/evaluation/README.md
... ... @@ -127,8 +127,8 @@ This framework now follows graded ranking evaluation closer to e-commerce best p
127 127 - **Composite tuning score: `Primary_Metric_Score`**
128 128 For experiment ranking we compute the mean of the primary scorecard after normalizing `Avg_Grade@10` by the max grade (`3`).
129 129 - **Gain scheme**
130   - `Fully Relevant=7`, `Mostly Relevant=3`, `Weakly Relevant=1`, `Irrelevant=0`
131   - The gains come from rel grades `3/2/1/0` with `gain = 2^rel - 1`, a standard `NDCG` setup.
  130 + `Fully Relevant=3`, `Mostly Relevant=2`, `Weakly Relevant=1`, `Irrelevant=0`
  131 + We keep the rel grades `3/2/1/0`, but the current implementation uses the grade values directly as gains so the exact/high gap is less aggressive.
132 132 - **Why this is better**
133 133 `NDCG` differentiates “exact”, “strong substitute”, and “weak substitute”, so swapping an `Fully Relevant` with a `Weakly Relevant` item is penalized more than swapping `Mostly Relevant` with `Weakly Relevant`.
134 134  
... ... @@ -174,6 +174,22 @@ Features: query list from `queries.txt`, single-query and batch evaluation, batc
174 174  
175 175 Each run stores aggregate and per-query metrics, label distribution, timestamp, metric context (including gain scheme and primary metric), and an `/admin/config` snapshot, as Markdown and JSON under `batch_reports/`.
176 176  
  177 +To make later case analysis reproducible without digging through backend logs, each per-query record in the batch JSON now also includes:
  178 +
  179 +- `request_id` — the exact `X-Request-ID` sent by the evaluator for that live search call
  180 +- `top_label_sequence_top10` / `top_label_sequence_top20` — compact label sequence strings such as `1:L3 | 2:L1 | 3:L2`
  181 +- `top_results` — a lightweight top-20 snapshot with `rank`, `spu_id`, `label`, title fields, and `relevance_score`
  182 +
  183 +The Markdown report now surfaces the same case context in a lighter human-readable form:
  184 +
  185 +- request id
  186 +- top-10 / top-20 label sequence
  187 +- top 5 result snapshot for quick scanning
  188 +
  189 +This means a bad case can usually be reconstructed directly from the batch artifact itself, without replaying logs or joining SQLite tables by hand.
  190 +
  191 +The web history endpoint intentionally returns a compact summary only (aggregate metrics plus query count), so adding richer per-query snapshots to the batch payload does not bloat the history list UI.
  192 +
177 193 ## Ranking debug and LTR prep
178 194  
179 195 `debug_info` now exposes two extra layers that are useful for tuning and future learning-to-rank work:
... ...
scripts/evaluation/eval_framework/__init__.py
... ... @@ -14,10 +14,10 @@ from .constants import ( # noqa: E402
14 14 DEFAULT_ARTIFACT_ROOT,
15 15 DEFAULT_QUERY_FILE,
16 16 PROJECT_ROOT,
17   - RELEVANCE_EXACT,
18   - RELEVANCE_HIGH,
19   - RELEVANCE_IRRELEVANT,
20   - RELEVANCE_LOW,
  17 + RELEVANCE_LV0,
  18 + RELEVANCE_LV1,
  19 + RELEVANCE_LV2,
  20 + RELEVANCE_LV3,
21 21 RELEVANCE_NON_IRRELEVANT,
22 22 VALID_LABELS,
23 23 )
... ... @@ -39,10 +39,10 @@ __all__ = [
39 39 "EvalStore",
40 40 "PROJECT_ROOT",
41 41 "QueryBuildResult",
42   - "RELEVANCE_EXACT",
43   - "RELEVANCE_HIGH",
44   - "RELEVANCE_IRRELEVANT",
45   - "RELEVANCE_LOW",
  42 + "RELEVANCE_LV0",
  43 + "RELEVANCE_LV1",
  44 + "RELEVANCE_LV2",
  45 + "RELEVANCE_LV3",
46 46 "RELEVANCE_NON_IRRELEVANT",
47 47 "SearchEvaluationFramework",
48 48 "VALID_LABELS",
... ...
scripts/evaluation/eval_framework/clients.py
... ... @@ -157,6 +157,7 @@ class SearchServiceClient:
157 157 return self._request_json("GET", path, timeout=timeout)
158 158  
159 159 def search(self, query: str, size: int, from_: int = 0, language: str = "en", *, debug: bool = False) -> Dict[str, Any]:
  160 + request_id = uuid.uuid4().hex[:8]
160 161 payload: Dict[str, Any] = {
161 162 "query": query,
162 163 "size": size,
... ... @@ -165,13 +166,19 @@ class SearchServiceClient:
165 166 }
166 167 if debug:
167 168 payload["debug"] = True
168   - return self._request_json(
  169 + response = self._request_json(
169 170 "POST",
170 171 "/search/",
171 172 timeout=120,
172   - headers={"Content-Type": "application/json", "X-Tenant-ID": self.tenant_id},
  173 + headers={
  174 + "Content-Type": "application/json",
  175 + "X-Tenant-ID": self.tenant_id,
  176 + "X-Request-ID": request_id,
  177 + },
173 178 json_payload=payload,
174 179 )
  180 + response["_eval_request_id"] = request_id
  181 + return response
175 182  
176 183  
177 184 class RerankServiceClient:
... ...
scripts/evaluation/eval_framework/constants.py
... ... @@ -7,24 +7,24 @@ _SCRIPTS_EVAL_DIR = _PKG_DIR.parent
7 7 PROJECT_ROOT = _SCRIPTS_EVAL_DIR.parents[1]
8 8  
9 9 # Canonical English labels (must match LLM prompt output in prompts._CLASSIFY_TEMPLATE_EN)
10   -RELEVANCE_EXACT = "Fully Relevant"
11   -RELEVANCE_HIGH = "Mostly Relevant"
12   -RELEVANCE_LOW = "Weakly Relevant"
13   -RELEVANCE_IRRELEVANT = "Irrelevant"
  10 +RELEVANCE_LV3 = "Fully Relevant"
  11 +RELEVANCE_LV2 = "Mostly Relevant"
  12 +RELEVANCE_LV1 = "Weakly Relevant"
  13 +RELEVANCE_LV0 = "Irrelevant"
14 14  
15   -VALID_LABELS = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_LOW, RELEVANCE_IRRELEVANT})
  15 +VALID_LABELS = frozenset({RELEVANCE_LV3, RELEVANCE_LV2, RELEVANCE_LV1, RELEVANCE_LV0})
16 16  
17 17 # Useful label sets for binary diagnostic slices layered on top of graded ranking metrics.
18   -RELEVANCE_NON_IRRELEVANT = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_LOW})
19   -RELEVANCE_STRONG = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH})
  18 +RELEVANCE_NON_IRRELEVANT = frozenset({RELEVANCE_LV3, RELEVANCE_LV2, RELEVANCE_LV1})
  19 +RELEVANCE_STRONG = frozenset({RELEVANCE_LV3, RELEVANCE_LV2})
20 20  
21 21 # Graded relevance for ranking evaluation.
22 22 # We use rel grades 3/2/1/0 and gain = 2^rel - 1, which is standard for NDCG-style metrics.
23 23 RELEVANCE_GRADE_MAP = {
24   - RELEVANCE_EXACT: 3,
25   - RELEVANCE_HIGH: 2,
26   - RELEVANCE_LOW: 1,
27   - RELEVANCE_IRRELEVANT: 0,
  24 + RELEVANCE_LV3: 3,
  25 + RELEVANCE_LV2: 2,
  26 + RELEVANCE_LV1: 1,
  27 + RELEVANCE_LV0: 0,
28 28 }
29 29 # 标准的gain计算方法:2^rel - 1
30 30 # 但是是因为标注质量不是特别精确,因此适当降低 exact 和 high 的区分度
... ... @@ -35,11 +35,12 @@ RELEVANCE_GAIN_MAP = {
35 35 }
36 36  
37 37 # P(stop | relevance) for ERR (Expected Reciprocal Rank); cascade model (Chapelle et al., 2009).
  38 +# p(t) = (2^t - 1) / 2^{max_grade}
38 39 STOP_PROB_MAP = {
39   - RELEVANCE_EXACT: 0.99,
40   - RELEVANCE_HIGH: 0.8,
41   - RELEVANCE_LOW: 0.1,
42   - RELEVANCE_IRRELEVANT: 0.0,
  40 + RELEVANCE_LV3: 0.875,
  41 + RELEVANCE_LV2: 0.375,
  42 + RELEVANCE_LV1: 0.125,
  43 + RELEVANCE_LV0: 0.0,
43 44 }
44 45  
45 46 DEFAULT_ARTIFACT_ROOT = PROJECT_ROOT / "artifacts" / "search_evaluation"
... ... @@ -78,7 +79,7 @@ DEFAULT_REBUILD_MAX_LLM_BATCHES = 40
78 79 # A batch is "bad" when **both** hold (strict inequalities; see ``framework._annotate_rebuild_batches``):
79 80 # - irrelevant_ratio > DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO (default 93.9%),
80 81 # - (Irrelevant + Weakly Relevant) / n > DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO (default 95.9%).
81   -# ``irrelevant_ratio`` = Irrelevant count / n; weak relevance is ``RELEVANCE_LOW`` ("Weakly Relevant").
  82 +# ``irrelevant_ratio`` = Irrelevant count / n; weak relevance is ``RELEVANCE_LV1`` ("Weakly Relevant").
82 83 # Increment streak on consecutive bad batches; reset on any non-bad batch. Stop when streak
83 84 # reaches ``DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK`` (default 3).
84 85 DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO = 0.799
... ...
scripts/evaluation/eval_framework/framework.py
... ... @@ -25,14 +25,14 @@ from .constants import (
25 25 DEFAULT_RERANK_HIGH_SKIP_COUNT,
26 26 DEFAULT_RERANK_HIGH_THRESHOLD,
27 27 DEFAULT_SEARCH_RECALL_TOP_K,
28   - RELEVANCE_EXACT,
29 28 RELEVANCE_GAIN_MAP,
30   - RELEVANCE_HIGH,
31   - STOP_PROB_MAP,
32   - RELEVANCE_IRRELEVANT,
33   - RELEVANCE_LOW,
  29 + RELEVANCE_LV0,
  30 + RELEVANCE_LV1,
  31 + RELEVANCE_LV2,
  32 + RELEVANCE_LV3,
34 33 RELEVANCE_NON_IRRELEVANT,
35 34 VALID_LABELS,
  35 + STOP_PROB_MAP,
36 36 )
37 37 from .metrics import (
38 38 PRIMARY_METRIC_GRADE_NORMALIZER,
... ... @@ -96,6 +96,16 @@ def _zh_titles_from_debug_per_result(debug_info: Any) -&gt; Dict[str, str]:
96 96 return out
97 97  
98 98  
  99 +def _encode_label_sequence(items: Sequence[Dict[str, Any]], limit: int) -> str:
  100 + parts: List[str] = []
  101 + for item in items[:limit]:
  102 + rank = int(item.get("rank") or 0)
  103 + label = str(item.get("label") or "")
  104 + grade = RELEVANCE_GAIN_MAP.get(label)
  105 + parts.append(f"{rank}:L{grade}" if grade is not None else f"{rank}:?")
  106 + return " | ".join(parts)
  107 +
  108 +
99 109 class SearchEvaluationFramework:
100 110 def __init__(
101 111 self,
... ... @@ -168,7 +178,7 @@ class SearchEvaluationFramework:
168 178 ) -> Dict[str, Any]:
169 179 live = self.evaluate_live_query(query=query, top_k=top_k, auto_annotate=auto_annotate, language=language)
170 180 labels = [
171   - item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT
  181 + item["label"] if item["label"] in VALID_LABELS else RELEVANCE_LV0
172 182 for item in live["results"]
173 183 ]
174 184 return {
... ... @@ -432,7 +442,7 @@ class SearchEvaluationFramework:
432 442  
433 443 - ``#(Irrelevant)/n > irrelevant_stop_ratio`` (default 0.939), and
434 444 - ``( #(Irrelevant) + #(Weakly Relevant) ) / n > irrelevant_low_combined_stop_ratio``
435   - (default 0.959; weak relevance = ``RELEVANCE_LOW``).
  445 + (default 0.959; weak relevance = ``RELEVANCE_LV1``).
436 446  
437 447 Maintain a streak of consecutive *bad* batches; any non-bad batch resets the streak to 0.
438 448 Stop labeling when ``streak >= stop_streak`` (default 3) or when ``max_batches`` is reached
... ... @@ -474,9 +484,9 @@ class SearchEvaluationFramework:
474 484 time.sleep(0.1)
475 485  
476 486 n = len(batch_docs)
477   - exact_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_EXACT)
478   - irrel_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_IRRELEVANT)
479   - low_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_LOW)
  487 + exact_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_LV3)
  488 + irrel_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_LV0)
  489 + low_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_LV1)
480 490 exact_ratio = exact_n / n if n else 0.0
481 491 irrelevant_ratio = irrel_n / n if n else 0.0
482 492 low_ratio = low_n / n if n else 0.0
... ... @@ -633,7 +643,7 @@ class SearchEvaluationFramework:
633 643 )
634 644  
635 645 top100_labels = [
636   - item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT
  646 + item["label"] if item["label"] in VALID_LABELS else RELEVANCE_LV0
637 647 for item in search_labeled_results[:100]
638 648 ]
639 649 metrics = compute_query_metrics(top100_labels, ideal_labels=list(labels.values()))
... ... @@ -843,7 +853,7 @@ class SearchEvaluationFramework:
843 853 )
844 854  
845 855 top100_labels = [
846   - item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT
  856 + item["label"] if item["label"] in VALID_LABELS else RELEVANCE_LV0
847 857 for item in search_labeled_results[:100]
848 858 ]
849 859 metrics = compute_query_metrics(top100_labels, ideal_labels=list(labels.values()))
... ... @@ -920,16 +930,17 @@ class SearchEvaluationFramework:
920 930 "title_zh": title_zh if title_zh and title_zh != primary_title else "",
921 931 "image_url": doc.get("image_url"),
922 932 "label": label,
  933 + "relevance_score": doc.get("relevance_score"),
923 934 "option_values": list(compact_option_values(doc.get("skus") or [])),
924 935 "product": compact_product_payload(doc),
925 936 }
926 937 )
927 938 metric_labels = [
928   - item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT
  939 + item["label"] if item["label"] in VALID_LABELS else RELEVANCE_LV0
929 940 for item in labeled
930 941 ]
931 942 ideal_labels = [
932   - label if label in VALID_LABELS else RELEVANCE_IRRELEVANT
  943 + label if label in VALID_LABELS else RELEVANCE_LV0
933 944 for label in labels.values()
934 945 ]
935 946 label_stats = self.store.get_query_label_stats(self.tenant_id, query)
... ... @@ -960,10 +971,10 @@ class SearchEvaluationFramework:
960 971 }
961 972 )
962 973 label_order = {
963   - RELEVANCE_EXACT: 0,
964   - RELEVANCE_HIGH: 1,
965   - RELEVANCE_LOW: 2,
966   - RELEVANCE_IRRELEVANT: 3,
  974 + RELEVANCE_LV3: 0,
  975 + RELEVANCE_LV2: 1,
  976 + RELEVANCE_LV1: 2,
  977 + RELEVANCE_LV0: 3,
967 978 }
968 979 missing_relevant.sort(
969 980 key=lambda item: (
... ... @@ -989,6 +1000,7 @@ class SearchEvaluationFramework:
989 1000 "top_k": top_k,
990 1001 "metrics": compute_query_metrics(metric_labels, ideal_labels=ideal_labels),
991 1002 "metric_context": _metric_context_payload(),
  1003 + "request_id": str(search_payload.get("_eval_request_id") or ""),
992 1004 "results": labeled,
993 1005 "missing_relevant": missing_relevant,
994 1006 "label_stats": {
... ... @@ -996,9 +1008,9 @@ class SearchEvaluationFramework:
996 1008 "unlabeled_hits_treated_irrelevant": unlabeled_hits,
997 1009 "recalled_hits": len(labeled),
998 1010 "missing_relevant_count": len(missing_relevant),
999   - "missing_exact_count": sum(1 for item in missing_relevant if item["label"] == RELEVANCE_EXACT),
1000   - "missing_high_count": sum(1 for item in missing_relevant if item["label"] == RELEVANCE_HIGH),
1001   - "missing_low_count": sum(1 for item in missing_relevant if item["label"] == RELEVANCE_LOW),
  1011 + "missing_exact_count": sum(1 for item in missing_relevant if item["label"] == RELEVANCE_LV3),
  1012 + "missing_high_count": sum(1 for item in missing_relevant if item["label"] == RELEVANCE_LV2),
  1013 + "missing_low_count": sum(1 for item in missing_relevant if item["label"] == RELEVANCE_LV1),
1002 1014 },
1003 1015 "tips": tips,
1004 1016 "total": int(search_payload.get("total") or 0),
... ... @@ -1014,6 +1026,7 @@ class SearchEvaluationFramework:
1014 1026 force_refresh_labels: bool = False,
1015 1027 ) -> Dict[str, Any]:
1016 1028 per_query = []
  1029 + case_snapshot_top_n = min(max(int(top_k), 1), 20)
1017 1030 total_q = len(queries)
1018 1031 _log.info("[batch-eval] starting %s queries top_k=%s auto_annotate=%s", total_q, top_k, auto_annotate)
1019 1032 for q_index, query in enumerate(queries, start=1):
... ... @@ -1025,7 +1038,7 @@ class SearchEvaluationFramework:
1025 1038 force_refresh_labels=force_refresh_labels,
1026 1039 )
1027 1040 labels = [
1028   - item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT
  1041 + item["label"] if item["label"] in VALID_LABELS else RELEVANCE_LV0
1029 1042 for item in live["results"]
1030 1043 ]
1031 1044 per_query.append(
... ... @@ -1036,6 +1049,21 @@ class SearchEvaluationFramework:
1036 1049 "metrics": live["metrics"],
1037 1050 "distribution": label_distribution(labels),
1038 1051 "total": live["total"],
  1052 + "request_id": live.get("request_id") or "",
  1053 + "case_snapshot_top_n": case_snapshot_top_n,
  1054 + "top_label_sequence_top10": _encode_label_sequence(live["results"], 10),
  1055 + "top_label_sequence_top20": _encode_label_sequence(live["results"], case_snapshot_top_n),
  1056 + "top_results": [
  1057 + {
  1058 + "rank": int(item.get("rank") or 0),
  1059 + "spu_id": str(item.get("spu_id") or ""),
  1060 + "label": item.get("label"),
  1061 + "title": item.get("title"),
  1062 + "title_zh": item.get("title_zh"),
  1063 + "relevance_score": item.get("relevance_score"),
  1064 + }
  1065 + for item in live["results"][:case_snapshot_top_n]
  1066 + ],
1039 1067 }
1040 1068 )
1041 1069 m = live["metrics"]
... ... @@ -1055,10 +1083,10 @@ class SearchEvaluationFramework:
1055 1083 )
1056 1084 aggregate = aggregate_metrics([item["metrics"] for item in per_query])
1057 1085 aggregate_distribution = {
1058   - RELEVANCE_EXACT: sum(item["distribution"][RELEVANCE_EXACT] for item in per_query),
1059   - RELEVANCE_HIGH: sum(item["distribution"][RELEVANCE_HIGH] for item in per_query),
1060   - RELEVANCE_LOW: sum(item["distribution"][RELEVANCE_LOW] for item in per_query),
1061   - RELEVANCE_IRRELEVANT: sum(item["distribution"][RELEVANCE_IRRELEVANT] for item in per_query),
  1086 + RELEVANCE_LV3: sum(item["distribution"][RELEVANCE_LV3] for item in per_query),
  1087 + RELEVANCE_LV2: sum(item["distribution"][RELEVANCE_LV2] for item in per_query),
  1088 + RELEVANCE_LV1: sum(item["distribution"][RELEVANCE_LV1] for item in per_query),
  1089 + RELEVANCE_LV0: sum(item["distribution"][RELEVANCE_LV0] for item in per_query),
1062 1090 }
1063 1091 batch_id = f"batch_{utc_timestamp()}_{sha1_text(self.tenant_id + '|' + '|'.join(queries))[:10]}"
1064 1092 report_dir = ensure_dir(self.artifact_root / "batch_reports")
... ...
scripts/evaluation/eval_framework/metrics.py
... ... @@ -6,12 +6,12 @@ import math
6 6 from typing import Dict, Iterable, Sequence
7 7  
8 8 from .constants import (
9   - RELEVANCE_EXACT,
10 9 RELEVANCE_GAIN_MAP,
11 10 RELEVANCE_GRADE_MAP,
12   - RELEVANCE_HIGH,
13   - RELEVANCE_IRRELEVANT,
14   - RELEVANCE_LOW,
  11 + RELEVANCE_LV0,
  12 + RELEVANCE_LV1,
  13 + RELEVANCE_LV2,
  14 + RELEVANCE_LV3,
15 15 RELEVANCE_NON_IRRELEVANT,
16 16 RELEVANCE_STRONG,
17 17 STOP_PROB_MAP,
... ... @@ -33,7 +33,7 @@ PRIMARY_METRIC_GRADE_NORMALIZER = float(max(RELEVANCE_GRADE_MAP.values()) or 1.0
33 33 def _normalize_label(label: str) -> str:
34 34 if label in RELEVANCE_GRADE_MAP:
35 35 return label
36   - return RELEVANCE_IRRELEVANT
  36 + return RELEVANCE_LV0
37 37  
38 38  
39 39 def _gains_for_labels(labels: Sequence[str]) -> list[float]:
... ... @@ -135,7 +135,7 @@ def compute_query_metrics(
135 135 ideal = list(ideal_labels) if ideal_labels is not None else list(labels)
136 136 metrics: Dict[str, float] = {}
137 137  
138   - exact_hits = _binary_hits(labels, [RELEVANCE_EXACT])
  138 + exact_hits = _binary_hits(labels, [RELEVANCE_LV3])
139 139 strong_hits = _binary_hits(labels, RELEVANCE_STRONG)
140 140 useful_hits = _binary_hits(labels, RELEVANCE_NON_IRRELEVANT)
141 141  
... ... @@ -183,8 +183,8 @@ def aggregate_metrics(metric_items: Sequence[Dict[str, float]]) -&gt; Dict[str, flo
183 183  
184 184 def label_distribution(labels: Sequence[str]) -> Dict[str, int]:
185 185 return {
186   - RELEVANCE_EXACT: sum(1 for label in labels if label == RELEVANCE_EXACT),
187   - RELEVANCE_HIGH: sum(1 for label in labels if label == RELEVANCE_HIGH),
188   - RELEVANCE_LOW: sum(1 for label in labels if label == RELEVANCE_LOW),
189   - RELEVANCE_IRRELEVANT: sum(1 for label in labels if label == RELEVANCE_IRRELEVANT),
  186 + RELEVANCE_LV3: sum(1 for label in labels if label == RELEVANCE_LV3),
  187 + RELEVANCE_LV2: sum(1 for label in labels if label == RELEVANCE_LV2),
  188 + RELEVANCE_LV1: sum(1 for label in labels if label == RELEVANCE_LV1),
  189 + RELEVANCE_LV0: sum(1 for label in labels if label == RELEVANCE_LV0),
190 190 }
... ...
scripts/evaluation/eval_framework/reports.py
... ... @@ -4,7 +4,7 @@ from __future__ import annotations
4 4  
5 5 from typing import Any, Dict
6 6  
7   -from .constants import RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_IRRELEVANT, RELEVANCE_LOW
  7 +from .constants import RELEVANCE_GAIN_MAP, RELEVANCE_LV0, RELEVANCE_LV1, RELEVANCE_LV2, RELEVANCE_LV3
8 8 from .metrics import PRIMARY_METRIC_KEYS
9 9  
10 10  
... ... @@ -25,6 +25,38 @@ def _append_metric_block(lines: list[str], metrics: Dict[str, Any]) -&gt; None:
25 25 lines.append(f"- {key}: {value}")
26 26  
27 27  
  28 +def _label_level_code(label: str) -> str:
  29 + grade = RELEVANCE_GAIN_MAP.get(label)
  30 + return f"L{grade}" if grade is not None else "?"
  31 +
  32 +
  33 +def _append_case_snapshot(lines: list[str], item: Dict[str, Any]) -> None:
  34 + request_id = str(item.get("request_id") or "").strip()
  35 + if request_id:
  36 + lines.append(f"- Request ID: `{request_id}`")
  37 + seq10 = str(item.get("top_label_sequence_top10") or "").strip()
  38 + if seq10:
  39 + lines.append(f"- Top-10 Labels: `{seq10}`")
  40 + seq20 = str(item.get("top_label_sequence_top20") or "").strip()
  41 + if seq20 and seq20 != seq10:
  42 + lines.append(f"- Top-20 Labels: `{seq20}`")
  43 + top_results = item.get("top_results") or []
  44 + if not top_results:
  45 + return
  46 + lines.append("- Case Snapshot:")
  47 + for result in top_results[:5]:
  48 + rank = int(result.get("rank") or 0)
  49 + label = _label_level_code(str(result.get("label") or ""))
  50 + spu_id = str(result.get("spu_id") or "")
  51 + title = str(result.get("title") or "")
  52 + title_zh = str(result.get("title_zh") or "")
  53 + relevance_score = result.get("relevance_score")
  54 + score_suffix = f" (rel={relevance_score})" if relevance_score not in (None, "") else ""
  55 + lines.append(f" - #{rank} [{label}] spu={spu_id} {title}{score_suffix}")
  56 + if title_zh:
  57 + lines.append(f" zh: {title_zh}")
  58 +
  59 +
28 60 def render_batch_report_markdown(payload: Dict[str, Any]) -> str:
29 61 lines = [
30 62 "# Search Batch Evaluation",
... ... @@ -56,10 +88,10 @@ def render_batch_report_markdown(payload: Dict[str, Any]) -&gt; str:
56 88 "",
57 89 "## Label Distribution",
58 90 "",
59   - f"- Fully Relevant: {distribution.get(RELEVANCE_EXACT, 0)}",
60   - f"- Mostly Relevant: {distribution.get(RELEVANCE_HIGH, 0)}",
61   - f"- Weakly Relevant: {distribution.get(RELEVANCE_LOW, 0)}",
62   - f"- Irrelevant: {distribution.get(RELEVANCE_IRRELEVANT, 0)}",
  91 + f"- Fully Relevant: {distribution.get(RELEVANCE_LV3, 0)}",
  92 + f"- Mostly Relevant: {distribution.get(RELEVANCE_LV2, 0)}",
  93 + f"- Weakly Relevant: {distribution.get(RELEVANCE_LV1, 0)}",
  94 + f"- Irrelevant: {distribution.get(RELEVANCE_LV0, 0)}",
63 95 ]
64 96 )
65 97 lines.extend(["", "## Per Query", ""])
... ... @@ -68,9 +100,10 @@ def render_batch_report_markdown(payload: Dict[str, Any]) -&gt; str:
68 100 lines.append("")
69 101 _append_metric_block(lines, item.get("metrics") or {})
70 102 distribution = item.get("distribution") or {}
71   - lines.append(f"- Fully Relevant: {distribution.get(RELEVANCE_EXACT, 0)}")
72   - lines.append(f"- Mostly Relevant: {distribution.get(RELEVANCE_HIGH, 0)}")
73   - lines.append(f"- Weakly Relevant: {distribution.get(RELEVANCE_LOW, 0)}")
74   - lines.append(f"- Irrelevant: {distribution.get(RELEVANCE_IRRELEVANT, 0)}")
  103 + lines.append(f"- Fully Relevant: {distribution.get(RELEVANCE_LV3, 0)}")
  104 + lines.append(f"- Mostly Relevant: {distribution.get(RELEVANCE_LV2, 0)}")
  105 + lines.append(f"- Weakly Relevant: {distribution.get(RELEVANCE_LV1, 0)}")
  106 + lines.append(f"- Irrelevant: {distribution.get(RELEVANCE_LV0, 0)}")
  107 + _append_case_snapshot(lines, item)
75 108 lines.append("")
76 109 return "\n".join(lines)
... ...
scripts/evaluation/eval_framework/static/eval_web.js
... ... @@ -190,7 +190,7 @@ async function loadQueries() {
190 190  
191 191 function historySummaryHtml(meta) {
192 192 const m = meta && meta.aggregate_metrics;
193   - const nq = (meta && meta.queries && meta.queries.length) || (meta && meta.per_query && meta.per_query.length) || null;
  193 + const nq = (meta && meta.query_count) || (meta && meta.queries && meta.queries.length) || (meta && meta.per_query && meta.per_query.length) || null;
194 194 const parts = [];
195 195 if (nq != null) parts.push(`<span>Queries</span> ${nq}`);
196 196 if (m && m["Primary_Metric_Score"] != null) parts.push(`<span>Primary</span> ${fmtNumber(m["Primary_Metric_Score"])}`);
... ...
scripts/evaluation/eval_framework/store.py
... ... @@ -23,6 +23,18 @@ class QueryBuildResult:
23 23 output_json_path: Path
24 24  
25 25  
  26 +def _compact_batch_metadata(metadata: Dict[str, Any]) -> Dict[str, Any]:
  27 + return {
  28 + "batch_id": metadata.get("batch_id"),
  29 + "created_at": metadata.get("created_at"),
  30 + "tenant_id": metadata.get("tenant_id"),
  31 + "top_k": metadata.get("top_k"),
  32 + "query_count": len(metadata.get("queries") or []),
  33 + "aggregate_metrics": dict(metadata.get("aggregate_metrics") or {}),
  34 + "metric_context": dict(metadata.get("metric_context") or {}),
  35 + }
  36 +
  37 +
26 38 class EvalStore:
27 39 def __init__(self, db_path: Path):
28 40 self.db_path = db_path
... ... @@ -339,6 +351,7 @@ class EvalStore:
339 351 ).fetchall()
340 352 items: List[Dict[str, Any]] = []
341 353 for row in rows:
  354 + metadata = json.loads(row["metadata_json"])
342 355 items.append(
343 356 {
344 357 "batch_id": row["batch_id"],
... ... @@ -346,7 +359,7 @@ class EvalStore:
346 359 "output_json_path": row["output_json_path"],
347 360 "report_markdown_path": row["report_markdown_path"],
348 361 "config_snapshot_path": row["config_snapshot_path"],
349   - "metadata": json.loads(row["metadata_json"]),
  362 + "metadata": _compact_batch_metadata(metadata),
350 363 "created_at": row["created_at"],
351 364 }
352 365 )
... ...
scripts/evaluation/offline_ltr_fit.py
... ... @@ -23,11 +23,11 @@ if str(PROJECT_ROOT) not in sys.path:
23 23  
24 24 from scripts.evaluation.eval_framework.constants import (
25 25 DEFAULT_ARTIFACT_ROOT,
26   - RELEVANCE_EXACT,
27 26 RELEVANCE_GRADE_MAP,
28   - RELEVANCE_HIGH,
29   - RELEVANCE_IRRELEVANT,
30   - RELEVANCE_LOW,
  27 + RELEVANCE_LV0,
  28 + RELEVANCE_LV1,
  29 + RELEVANCE_LV2,
  30 + RELEVANCE_LV3,
31 31 )
32 32 from scripts.evaluation.eval_framework.metrics import aggregate_metrics, compute_query_metrics
33 33 from scripts.evaluation.eval_framework.store import EvalStore
... ... @@ -35,10 +35,10 @@ from scripts.evaluation.eval_framework.utils import ensure_dir, utc_timestamp
35 35  
36 36  
37 37 LABELS_BY_GRADE = {
38   - 3: RELEVANCE_EXACT,
39   - 2: RELEVANCE_HIGH,
40   - 1: RELEVANCE_LOW,
41   - 0: RELEVANCE_IRRELEVANT,
  38 + 3: RELEVANCE_LV3,
  39 + 2: RELEVANCE_LV2,
  40 + 1: RELEVANCE_LV1,
  41 + 0: RELEVANCE_LV0,
42 42 }
43 43  
44 44  
... ...
scripts/frontend/frontend_server.py 0 → 100755
... ... @@ -0,0 +1,278 @@
  1 +#!/usr/bin/env python3
  2 +"""
  3 +Simple HTTP server for saas-search frontend.
  4 +"""
  5 +
  6 +import http.server
  7 +import socketserver
  8 +import os
  9 +import sys
  10 +import logging
  11 +import time
  12 +import urllib.request
  13 +import urllib.error
  14 +from collections import defaultdict, deque
  15 +from pathlib import Path
  16 +from dotenv import load_dotenv
  17 +
  18 +# Load .env file
  19 +project_root = Path(__file__).resolve().parents[2]
  20 +load_dotenv(project_root / '.env')
  21 +
  22 +# Get API_BASE_URL from environment(默认不注入,避免被旧 .env 覆盖同源策略)
  23 +# 仅当显式设置 FRONTEND_INJECT_API_BASE_URL=1 时才注入 window.API_BASE_URL。
  24 +API_BASE_URL = os.getenv('API_BASE_URL') or None
  25 +INJECT_API_BASE_URL = os.getenv('FRONTEND_INJECT_API_BASE_URL', '0') == '1'
  26 +# Backend proxy target for same-origin API forwarding
  27 +BACKEND_PROXY_URL = os.getenv('BACKEND_PROXY_URL', 'http://127.0.0.1:6002').rstrip('/')
  28 +
  29 +# Change to frontend directory
  30 +frontend_dir = os.path.join(project_root, 'frontend')
  31 +os.chdir(frontend_dir)
  32 +
  33 +# FRONTEND_PORT is the canonical config; keep PORT as a secondary fallback.
  34 +PORT = int(os.getenv('FRONTEND_PORT', os.getenv('PORT', 6003)))
  35 +
  36 +# Configure logging to suppress scanner noise
  37 +logging.basicConfig(level=logging.ERROR, format='%(asctime)s - %(levelname)s - %(message)s')
  38 +
  39 +class RateLimitingMixin:
  40 + """Mixin for rate limiting requests by IP address."""
  41 + request_counts = defaultdict(deque)
  42 + rate_limit = 100 # requests per minute
  43 + window = 60 # seconds
  44 +
  45 + @classmethod
  46 + def is_rate_limited(cls, ip):
  47 + now = time.time()
  48 +
  49 + # Clean old requests
  50 + while cls.request_counts[ip] and cls.request_counts[ip][0] < now - cls.window:
  51 + cls.request_counts[ip].popleft()
  52 +
  53 + # Check rate limit
  54 + if len(cls.request_counts[ip]) > cls.rate_limit:
  55 + return True
  56 +
  57 + cls.request_counts[ip].append(now)
  58 + return False
  59 +
  60 +class MyHTTPRequestHandler(http.server.SimpleHTTPRequestHandler, RateLimitingMixin):
  61 + """Custom request handler with CORS support and robust error handling."""
  62 +
  63 + _ALLOWED_CORS_HEADERS = "Content-Type, X-Tenant-ID, X-Request-ID, Referer"
  64 +
  65 + def _is_proxy_path(self, path: str) -> bool:
  66 + """Return True for API paths that should be forwarded to backend service."""
  67 + return path.startswith('/search/') or path.startswith('/admin/') or path.startswith('/indexer/')
  68 +
  69 + def _proxy_to_backend(self):
  70 + """Proxy current request to backend service on the GPU server."""
  71 + target_url = f"{BACKEND_PROXY_URL}{self.path}"
  72 + method = self.command.upper()
  73 +
  74 + try:
  75 + content_length = int(self.headers.get('Content-Length', '0'))
  76 + except ValueError:
  77 + content_length = 0
  78 + body = self.rfile.read(content_length) if content_length > 0 else None
  79 +
  80 + forward_headers = {}
  81 + for key, value in self.headers.items():
  82 + lk = key.lower()
  83 + if lk in ('host', 'content-length', 'connection'):
  84 + continue
  85 + forward_headers[key] = value
  86 +
  87 + req = urllib.request.Request(
  88 + target_url,
  89 + data=body,
  90 + headers=forward_headers,
  91 + method=method,
  92 + )
  93 +
  94 + try:
  95 + with urllib.request.urlopen(req, timeout=30) as resp:
  96 + resp_body = resp.read()
  97 + self.send_response(resp.getcode())
  98 + for header, value in resp.getheaders():
  99 + lh = header.lower()
  100 + if lh in ('transfer-encoding', 'connection', 'content-length'):
  101 + continue
  102 + self.send_header(header, value)
  103 + self.end_headers()
  104 + self.wfile.write(resp_body)
  105 + except urllib.error.HTTPError as e:
  106 + err_body = e.read() if hasattr(e, 'read') else b''
  107 + self.send_response(e.code)
  108 + if e.headers:
  109 + for header, value in e.headers.items():
  110 + lh = header.lower()
  111 + if lh in ('transfer-encoding', 'connection', 'content-length'):
  112 + continue
  113 + self.send_header(header, value)
  114 + self.end_headers()
  115 + if err_body:
  116 + self.wfile.write(err_body)
  117 + except Exception as e:
  118 + logging.error(f"Backend proxy error for {method} {self.path}: {e}")
  119 + self.send_response(502)
  120 + self.send_header('Content-Type', 'application/json; charset=utf-8')
  121 + self.end_headers()
  122 + self.wfile.write(b'{"error":"Bad Gateway: backend proxy failed"}')
  123 +
  124 + def do_GET(self):
  125 + """Handle GET requests with API config injection."""
  126 + path = self.path.split('?')[0]
  127 +
  128 + # Proxy API paths to backend first
  129 + if self._is_proxy_path(path):
  130 + self._proxy_to_backend()
  131 + return
  132 +
  133 + # Route / to index.html
  134 + if path == '/' or path == '':
  135 + self.path = '/index.html' + (self.path.split('?', 1)[1] if '?' in self.path else '')
  136 +
  137 + # Inject API config for HTML files
  138 + if self.path.endswith('.html'):
  139 + self._serve_html_with_config()
  140 + else:
  141 + super().do_GET()
  142 +
  143 + def _serve_html_with_config(self):
  144 + """Serve HTML with optional API_BASE_URL injected."""
  145 + try:
  146 + file_path = self.path.lstrip('/')
  147 + if not os.path.exists(file_path):
  148 + self.send_error(404)
  149 + return
  150 +
  151 + with open(file_path, 'r', encoding='utf-8') as f:
  152 + html = f.read()
  153 +
  154 + # 默认不注入 API_BASE_URL,避免历史 .env(如 http://xx:6002)覆盖同源调用。
  155 + # 仅当 FRONTEND_INJECT_API_BASE_URL=1 且 API_BASE_URL 有值时才注入。
  156 + if INJECT_API_BASE_URL and API_BASE_URL:
  157 + config_script = f'<script>window.API_BASE_URL="{API_BASE_URL}";</script>\n '
  158 + html = html.replace('<script src="/static/js/app.js', config_script + '<script src="/static/js/app.js', 1)
  159 +
  160 + self.send_response(200)
  161 + self.send_header('Content-Type', 'text/html; charset=utf-8')
  162 + self.end_headers()
  163 + self.wfile.write(html.encode('utf-8'))
  164 + except Exception as e:
  165 + logging.error(f"Error serving HTML: {e}")
  166 + self.send_error(500)
  167 +
  168 + def do_POST(self):
  169 + """Handle POST requests. Proxy API requests to backend."""
  170 + path = self.path.split('?')[0]
  171 + if self._is_proxy_path(path):
  172 + self._proxy_to_backend()
  173 + return
  174 + self.send_error(405, "Method Not Allowed")
  175 +
  176 + def setup(self):
  177 + """Setup with error handling."""
  178 + try:
  179 + super().setup()
  180 + except Exception:
  181 + pass # Silently handle setup errors from scanners
  182 +
  183 + def handle_one_request(self):
  184 + """Handle single request with error catching."""
  185 + try:
  186 + # Check rate limiting
  187 + client_ip = self.client_address[0]
  188 + if self.is_rate_limited(client_ip):
  189 + logging.warning(f"Rate limiting IP: {client_ip}")
  190 + self.send_error(429, "Too Many Requests")
  191 + return
  192 +
  193 + super().handle_one_request()
  194 + except (ConnectionResetError, BrokenPipeError):
  195 + # Client disconnected prematurely - common with scanners
  196 + pass
  197 + except UnicodeDecodeError:
  198 + # Binary data received - not HTTP
  199 + pass
  200 + except Exception as e:
  201 + # Log unexpected errors but don't crash
  202 + logging.debug(f"Request handling error: {e}")
  203 +
  204 + def log_message(self, format, *args):
  205 + """Suppress logging for malformed requests from scanners."""
  206 + message = format % args
  207 + # Filter out scanner noise
  208 + noise_patterns = [
  209 + "code 400",
  210 + "Bad request",
  211 + "Bad request version",
  212 + "Bad HTTP/0.9 request type",
  213 + "Bad request syntax"
  214 + ]
  215 + if any(pattern in message for pattern in noise_patterns):
  216 + return
  217 + # Only log legitimate requests
  218 + if message and not message.startswith(" ") and len(message) > 10:
  219 + super().log_message(format, *args)
  220 +
  221 + def end_headers(self):
  222 + # Add CORS headers
  223 + self.send_header('Access-Control-Allow-Origin', '*')
  224 + self.send_header('Access-Control-Allow-Methods', 'GET, POST, OPTIONS')
  225 + self.send_header('Access-Control-Allow-Headers', self._ALLOWED_CORS_HEADERS)
  226 + # Add security headers
  227 + self.send_header('X-Content-Type-Options', 'nosniff')
  228 + self.send_header('X-Frame-Options', 'DENY')
  229 + self.send_header('X-XSS-Protection', '1; mode=block')
  230 + super().end_headers()
  231 +
  232 + def do_OPTIONS(self):
  233 + """Handle OPTIONS requests."""
  234 + try:
  235 + path = self.path.split('?')[0]
  236 + if self._is_proxy_path(path):
  237 + self.send_response(204)
  238 + self.end_headers()
  239 + return
  240 + self.send_response(200)
  241 + self.end_headers()
  242 + except Exception:
  243 + pass
  244 +
  245 +class ThreadedTCPServer(socketserver.ThreadingMixIn, socketserver.TCPServer):
  246 + """Threaded TCP server with better error handling."""
  247 + allow_reuse_address = True
  248 + daemon_threads = True
  249 +
  250 +if __name__ == '__main__':
  251 + # Check if port is already in use
  252 + import socket
  253 + sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
  254 + try:
  255 + sock.bind(("", PORT))
  256 + sock.close()
  257 + except OSError:
  258 + print(f"ERROR: Port {PORT} is already in use.")
  259 + print(f"Please stop the existing server or use a different port.")
  260 + print(f"To stop existing server: kill $(lsof -t -i:{PORT})")
  261 + sys.exit(1)
  262 +
  263 + # Create threaded server for better concurrency
  264 + with ThreadedTCPServer(("", PORT), MyHTTPRequestHandler) as httpd:
  265 + print(f"Frontend server started at http://localhost:{PORT}")
  266 + print(f"Serving files from: {os.getcwd()}")
  267 + print("\nPress Ctrl+C to stop the server")
  268 +
  269 + try:
  270 + httpd.serve_forever()
  271 + except KeyboardInterrupt:
  272 + print("\nShutting down server...")
  273 + httpd.shutdown()
  274 + print("Server stopped")
  275 + sys.exit(0)
  276 + except Exception as e:
  277 + print(f"Server error: {e}")
  278 + sys.exit(1)
... ...
scripts/frontend_server.py 100755 → 100644
1 1 #!/usr/bin/env python3
2   -"""
3   -Simple HTTP server for saas-search frontend.
4   -"""
  2 +"""Backward-compatible frontend server entrypoint."""
5 3  
6   -import http.server
7   -import socketserver
8   -import os
9   -import sys
10   -import logging
11   -import time
12   -import urllib.request
13   -import urllib.error
14   -from collections import defaultdict, deque
15   -from pathlib import Path
16   -from dotenv import load_dotenv
17   -
18   -# Load .env file
19   -project_root = Path(__file__).parent.parent
20   -load_dotenv(project_root / '.env')
21   -
22   -# Get API_BASE_URL from environment(默认不注入,避免被旧 .env 覆盖同源策略)
23   -# 仅当显式设置 FRONTEND_INJECT_API_BASE_URL=1 时才注入 window.API_BASE_URL。
24   -API_BASE_URL = os.getenv('API_BASE_URL') or None
25   -INJECT_API_BASE_URL = os.getenv('FRONTEND_INJECT_API_BASE_URL', '0') == '1'
26   -# Backend proxy target for same-origin API forwarding
27   -BACKEND_PROXY_URL = os.getenv('BACKEND_PROXY_URL', 'http://127.0.0.1:6002').rstrip('/')
28   -
29   -# Change to frontend directory
30   -frontend_dir = os.path.join(os.path.dirname(__file__), '../frontend')
31   -os.chdir(frontend_dir)
32   -
33   -# FRONTEND_PORT is the canonical config; keep PORT as a secondary fallback.
34   -PORT = int(os.getenv('FRONTEND_PORT', os.getenv('PORT', 6003)))
35   -
36   -# Configure logging to suppress scanner noise
37   -logging.basicConfig(level=logging.ERROR, format='%(asctime)s - %(levelname)s - %(message)s')
38   -
39   -class RateLimitingMixin:
40   - """Mixin for rate limiting requests by IP address."""
41   - request_counts = defaultdict(deque)
42   - rate_limit = 100 # requests per minute
43   - window = 60 # seconds
44   -
45   - @classmethod
46   - def is_rate_limited(cls, ip):
47   - now = time.time()
48   -
49   - # Clean old requests
50   - while cls.request_counts[ip] and cls.request_counts[ip][0] < now - cls.window:
51   - cls.request_counts[ip].popleft()
52   -
53   - # Check rate limit
54   - if len(cls.request_counts[ip]) > cls.rate_limit:
55   - return True
56   -
57   - cls.request_counts[ip].append(now)
58   - return False
59   -
60   -class MyHTTPRequestHandler(http.server.SimpleHTTPRequestHandler, RateLimitingMixin):
61   - """Custom request handler with CORS support and robust error handling."""
62   -
63   - def _is_proxy_path(self, path: str) -> bool:
64   - """Return True for API paths that should be forwarded to backend service."""
65   - return path.startswith('/search/') or path.startswith('/admin/') or path.startswith('/indexer/')
66   -
67   - def _proxy_to_backend(self):
68   - """Proxy current request to backend service on the GPU server."""
69   - target_url = f"{BACKEND_PROXY_URL}{self.path}"
70   - method = self.command.upper()
71   -
72   - try:
73   - content_length = int(self.headers.get('Content-Length', '0'))
74   - except ValueError:
75   - content_length = 0
76   - body = self.rfile.read(content_length) if content_length > 0 else None
  4 +from __future__ import annotations
77 5  
78   - forward_headers = {}
79   - for key, value in self.headers.items():
80   - lk = key.lower()
81   - if lk in ('host', 'content-length', 'connection'):
82   - continue
83   - forward_headers[key] = value
84   -
85   - req = urllib.request.Request(
86   - target_url,
87   - data=body,
88   - headers=forward_headers,
89   - method=method,
90   - )
91   -
92   - try:
93   - with urllib.request.urlopen(req, timeout=30) as resp:
94   - resp_body = resp.read()
95   - self.send_response(resp.getcode())
96   - for header, value in resp.getheaders():
97   - lh = header.lower()
98   - if lh in ('transfer-encoding', 'connection', 'content-length'):
99   - continue
100   - self.send_header(header, value)
101   - self.end_headers()
102   - self.wfile.write(resp_body)
103   - except urllib.error.HTTPError as e:
104   - err_body = e.read() if hasattr(e, 'read') else b''
105   - self.send_response(e.code)
106   - if e.headers:
107   - for header, value in e.headers.items():
108   - lh = header.lower()
109   - if lh in ('transfer-encoding', 'connection', 'content-length'):
110   - continue
111   - self.send_header(header, value)
112   - self.end_headers()
113   - if err_body:
114   - self.wfile.write(err_body)
115   - except Exception as e:
116   - logging.error(f"Backend proxy error for {method} {self.path}: {e}")
117   - self.send_response(502)
118   - self.send_header('Content-Type', 'application/json; charset=utf-8')
119   - self.end_headers()
120   - self.wfile.write(b'{"error":"Bad Gateway: backend proxy failed"}')
121   -
122   - def do_GET(self):
123   - """Handle GET requests with API config injection."""
124   - path = self.path.split('?')[0]
125   -
126   - # Proxy API paths to backend first
127   - if self._is_proxy_path(path):
128   - self._proxy_to_backend()
129   - return
130   -
131   - # Route / to index.html
132   - if path == '/' or path == '':
133   - self.path = '/index.html' + (self.path.split('?', 1)[1] if '?' in self.path else '')
134   -
135   - # Inject API config for HTML files
136   - if self.path.endswith('.html'):
137   - self._serve_html_with_config()
138   - else:
139   - super().do_GET()
140   -
141   - def _serve_html_with_config(self):
142   - """Serve HTML with optional API_BASE_URL injected."""
143   - try:
144   - file_path = self.path.lstrip('/')
145   - if not os.path.exists(file_path):
146   - self.send_error(404)
147   - return
148   -
149   - with open(file_path, 'r', encoding='utf-8') as f:
150   - html = f.read()
151   -
152   - # 默认不注入 API_BASE_URL,避免历史 .env(如 http://xx:6002)覆盖同源调用。
153   - # 仅当 FRONTEND_INJECT_API_BASE_URL=1 且 API_BASE_URL 有值时才注入。
154   - if INJECT_API_BASE_URL and API_BASE_URL:
155   - config_script = f'<script>window.API_BASE_URL="{API_BASE_URL}";</script>\n '
156   - html = html.replace('<script src="/static/js/app.js', config_script + '<script src="/static/js/app.js', 1)
157   -
158   - self.send_response(200)
159   - self.send_header('Content-Type', 'text/html; charset=utf-8')
160   - self.end_headers()
161   - self.wfile.write(html.encode('utf-8'))
162   - except Exception as e:
163   - logging.error(f"Error serving HTML: {e}")
164   - self.send_error(500)
165   -
166   - def do_POST(self):
167   - """Handle POST requests. Proxy API requests to backend."""
168   - path = self.path.split('?')[0]
169   - if self._is_proxy_path(path):
170   - self._proxy_to_backend()
171   - return
172   - self.send_error(405, "Method Not Allowed")
173   -
174   - def setup(self):
175   - """Setup with error handling."""
176   - try:
177   - super().setup()
178   - except Exception:
179   - pass # Silently handle setup errors from scanners
180   -
181   - def handle_one_request(self):
182   - """Handle single request with error catching."""
183   - try:
184   - # Check rate limiting
185   - client_ip = self.client_address[0]
186   - if self.is_rate_limited(client_ip):
187   - logging.warning(f"Rate limiting IP: {client_ip}")
188   - self.send_error(429, "Too Many Requests")
189   - return
190   -
191   - super().handle_one_request()
192   - except (ConnectionResetError, BrokenPipeError):
193   - # Client disconnected prematurely - common with scanners
194   - pass
195   - except UnicodeDecodeError:
196   - # Binary data received - not HTTP
197   - pass
198   - except Exception as e:
199   - # Log unexpected errors but don't crash
200   - logging.debug(f"Request handling error: {e}")
201   -
202   - def log_message(self, format, *args):
203   - """Suppress logging for malformed requests from scanners."""
204   - message = format % args
205   - # Filter out scanner noise
206   - noise_patterns = [
207   - "code 400",
208   - "Bad request",
209   - "Bad request version",
210   - "Bad HTTP/0.9 request type",
211   - "Bad request syntax"
212   - ]
213   - if any(pattern in message for pattern in noise_patterns):
214   - return
215   - # Only log legitimate requests
216   - if message and not message.startswith(" ") and len(message) > 10:
217   - super().log_message(format, *args)
218   -
219   - def end_headers(self):
220   - # Add CORS headers
221   - self.send_header('Access-Control-Allow-Origin', '*')
222   - self.send_header('Access-Control-Allow-Methods', 'GET, POST, OPTIONS')
223   - self.send_header('Access-Control-Allow-Headers', 'Content-Type')
224   - # Add security headers
225   - self.send_header('X-Content-Type-Options', 'nosniff')
226   - self.send_header('X-Frame-Options', 'DENY')
227   - self.send_header('X-XSS-Protection', '1; mode=block')
228   - super().end_headers()
229   -
230   - def do_OPTIONS(self):
231   - """Handle OPTIONS requests."""
232   - try:
233   - path = self.path.split('?')[0]
234   - if self._is_proxy_path(path):
235   - self.send_response(204)
236   - self.end_headers()
237   - return
238   - self.send_response(200)
239   - self.end_headers()
240   - except Exception:
241   - pass
242   -
243   -class ThreadedTCPServer(socketserver.ThreadingMixIn, socketserver.TCPServer):
244   - """Threaded TCP server with better error handling."""
245   - allow_reuse_address = True
246   - daemon_threads = True
  6 +import runpy
  7 +from pathlib import Path
247 8  
248   -if __name__ == '__main__':
249   - # Check if port is already in use
250   - import socket
251   - sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
252   - try:
253   - sock.bind(("", PORT))
254   - sock.close()
255   - except OSError:
256   - print(f"ERROR: Port {PORT} is already in use.")
257   - print(f"Please stop the existing server or use a different port.")
258   - print(f"To stop existing server: kill $(lsof -t -i:{PORT})")
259   - sys.exit(1)
260   -
261   - # Create threaded server for better concurrency
262   - with ThreadedTCPServer(("", PORT), MyHTTPRequestHandler) as httpd:
263   - print(f"Frontend server started at http://localhost:{PORT}")
264   - print(f"Serving files from: {os.getcwd()}")
265   - print("\nPress Ctrl+C to stop the server")
266 9  
267   - try:
268   - httpd.serve_forever()
269   - except KeyboardInterrupt:
270   - print("\nShutting down server...")
271   - httpd.shutdown()
272   - print("Server stopped")
273   - sys.exit(0)
274   - except Exception as e:
275   - print(f"Server error: {e}")
276   - sys.exit(1)
  10 +if __name__ == "__main__":
  11 + target = Path(__file__).resolve().parent / "frontend" / "frontend_server.py"
  12 + runpy.run_path(str(target), run_name="__main__")
... ...
scripts/inspect/README.md 0 → 100644
... ... @@ -0,0 +1,10 @@
  1 +# Inspect Scripts
  2 +
  3 +这一组脚本用于做一次性诊断、索引检查和数据核对:
  4 +
  5 +- `check_data_source.py`
  6 +- `check_es_data.py`
  7 +- `check_index_mapping.py`
  8 +- `compare_index_mappings.py`
  9 +
  10 +它们依赖真实 DB / ES 环境,不属于 CI 测试或 benchmark。
... ...
scripts/check_data_source.py renamed to scripts/inspect/check_data_source.py
... ... @@ -14,8 +14,8 @@ import argparse
14 14 from pathlib import Path
15 15 from sqlalchemy import create_engine, text
16 16  
17   -# Add parent directory to path
18   -sys.path.insert(0, str(Path(__file__).parent.parent))
  17 +# Add repo root to path
  18 +sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
19 19  
20 20 from utils.db_connector import create_db_connection
21 21  
... ... @@ -298,4 +298,3 @@ def main():
298 298  
299 299 if __name__ == '__main__':
300 300 sys.exit(main())
301   -
... ...
scripts/check_es_data.py renamed to scripts/inspect/check_es_data.py
... ... @@ -8,7 +8,7 @@ import os
8 8 import argparse
9 9 from pathlib import Path
10 10  
11   -sys.path.insert(0, str(Path(__file__).parent.parent))
  11 +sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
12 12  
13 13 from utils.es_client import ESClient
14 14  
... ... @@ -265,4 +265,3 @@ def main():
265 265  
266 266 if __name__ == '__main__':
267 267 sys.exit(main())
268   -
... ...
scripts/check_index_mapping.py renamed to scripts/inspect/check_index_mapping.py
... ... @@ -8,7 +8,7 @@ import sys
8 8 import json
9 9 from pathlib import Path
10 10  
11   -sys.path.insert(0, str(Path(__file__).parent.parent))
  11 +sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
12 12  
13 13 from utils.es_client import get_es_client_from_env
14 14 from indexer.mapping_generator import get_tenant_index_name
... ...