Compare View
Commits (19)
-
Previously, both `b` and `k1` were set to `0.0`. The original intention was to avoid two common issues in e-commerce search relevance: 1. Over-penalizing longer product titles In product search, a shorter title should not automatically rank higher just because BM25 favors shorter fields. For example, for a query like “遥控车”, a product whose title is simply “遥控车” is not necessarily a better candidate than a product with a slightly longer but more descriptive title. In practice, extremely short titles may even indicate lower-quality catalog data. 2. Over-rewarding repeated occurrences of the same term For longer queries such as “遥控喷雾翻滚多功能车玩具车”, the default BM25 behavior may give too much weight to a term that appears multiple times (for example “遥控”), even when other important query terms such as “喷雾” or “翻滚” are missing. This can cause products with repeated partial matches to outrank products that actually cover more of the user intent. Setting both parameters to zero was an intentional way to suppress length normalization and term-frequency amplification. However, after introducing a `combined_fields` query, this configuration becomes too aggressive. Since `combined_fields` scores multiple fields as a unified relevance signal, completely disabling both effects may also remove useful ranking information, especially when we still want documents matching more query terms across fields to be distinguishable from weaker matches. This update therefore relaxes the previous setting and reintroduces a controlled amount of BM25 normalization/scoring behavior. The goal is to keep the original intent — avoiding short-title bias and excessive repeated-term gain — while allowing the combined query to better preserve meaningful relevance differences across candidates. Expected effect: - reduce the bias toward unnaturally short product titles - limit score inflation caused by repeated occurrences of the same term - improve ranking stability for `combined_fields` queries - better reward candidates that cover more of the overall query intent, instead of those that only repeat a subset of terms
-
字段生成 - 新增分类法属性富化能力,遵循 enriched_attributes 相同的字段结构和处理逻辑,仅提示词和解析维度不同 - 引入 AnalysisSchema 抽象类,使内容富化(content)与分类法富化(taxonomy)共享批处理、缓存、提示词构建、Markdown 解析及归一化流程 - 重构 product_enrich.py 中原有的富化管道,将通用逻辑抽取至 _process_batch_for_schema、_parse_markdown_to_attributes 等函数,消除代码重复 - 在 product_enrich_prompts.py 中添加分类法提示词模板(TAXONOMY_ANALYSIS_PROMPT)及 Markdown 表头定义(TAXONOMY_HEADERS) - 修复 Markdown 解析器在空单元格时的行为:原实现会跳过空单元格导致列错位,现改为保留空值,确保稀疏的分类法属性列正确对齐 - 更新 document_transformer.py 中 build_index_content_fields 函数,将 enriched_taxonomy_attributes(中/英)写入最终索引文档 - 调整相关单元测试(test_product_enrich_partial_mode.py 等)以覆盖新字段路径,测试通过(14 passed) 技术细节: - AnalysisSchema 包含 schema_name、prompt_template、headers、field_name_prefix 等元数据 - 缓存键区分内容/分类法:`enrich:{schema_name}:{product_id}`,避免缓存污染 - 分类法解析使用与 enriched_attributes 相同的嵌套结构:`{"attribute_key": "value"}`,支持多行表格 - 批处理大小与重试逻辑保持与原有内容富化一致 -
- `/indexer/enrich-content` 路由`enriched_taxonomy_attributes` 与 `enriched_attributes` 一并返回 - 新增请求参数 `analysis_kinds`(可选,默认 `["content", "taxonomy"]`),允许调用方按需选择内容分析类型,为后续扩展和成本控制预留空间 - 重构缓存策略:将 `content` 与 `taxonomy` 两类分析的缓存完全隔离,缓存 key 包含 prompt 模板、表头、输出字段定义(即 schema 指纹),确保提示词或解析规则变更时自动失效 - 缓存 key 仅依赖真正参与 LLM 输入的字段(`title`、`brief`、`description`),`image_url`、`tenant_id`、`spu_id` 不再污染缓存键,提高缓存命中率 - 更新 API 文档(`docs/搜索API对接指南-05-索引接口(Indexer).md`),说明新增参数与返回字段 技术细节: - 路由层调整:在 `api/routes/indexer.py` 的 enrich-content 端点中,将 `product_enrich.enrich_products_batch` 返回的 `enriched_taxonomy_attributes` 字段显式加入 HTTP 响应体 - `analysis_kinds` 参数透传至底层 `enrich_products_batch`,支持按需跳过某一类分析(如仅需 taxonomy 时减少 LLM 调用) - 缓存指纹计算位于 `product_enrich.py` 的 `_get_cache_key` 函数,对每种 `AnalysisSchema` 独立生成;版本号通过 `schema.version` 或 prompt 内容哈希隐式包含 - 测试覆盖:新增 `analysis_kinds` 组合场景及缓存隔离测试
-
category_taxonomy_profile - 原 analysis_kinds 混用了“增强类型”(content/taxonomy)与“品类特定配置”,不利于扩展不同品类的 taxonomy 分析(如 3C、家居等) - 新增 enrichment_scopes 参数:支持 generic(通用增强,产出 qanchors/enriched_tags/enriched_attributes)和 category_taxonomy(品类增强,产出 enriched_taxonomy_attributes) - 新增 category_taxonomy_profile 参数:指定品类增强使用哪套 profile(当前内置 apparel),每套 profile 包含独立的 prompt、输出列定义、解析规则及缓存版本 - 保留 analysis_kinds 作为兼容别名,避免破坏现有调用方 - 重构内部 taxonomy 分析为 profile registry 模式:新增 _get_taxonomy_schema(profile_name) 函数,根据 profile 动态返回对应的 AnalysisSchema - 缓存 key 现在按“分析类型 + profile + schema 指纹 + 输入字段哈希”隔离,确保不同品类、不同 prompt 版本自动失效 - 更新 API 文档及微服务接口文档,明确新参数语义与使用示例 技术细节: - 修改入口:api/routes/indexer.py 中 enrich-content 端点,解析新参数并向下传递 - 核心逻辑:indexer/product_enrich.py 中 enrich_products_batch 增加 profile 参数;_process_batch_for_schema 根据 scope 和 profile 动态获取 schema - 兼容层:若请求同时提供 analysis_kinds,则映射为 enrichment_scopes(content→generic,taxonomy→category_taxonomy),category_taxonomy_profile 默认为 "apparel" - 测试覆盖:新增 enrichment_scopes 组合、profile 切换及兼容模式测试
-
本次迭代对检索系统的内容复化模块进行了较大规模的重构,将原先硬编码的“仅服饰(apparel)”品类拓展至 taxonomy.md 中定义的所有品类,同时优化了代码结构,降低了扩展新品类的成本。核心设计采用注册表模式(profile registry),按品类 profile 分组进行批处理,并明确区分双语(zh+en)与仅英文(en)输出策略。 【修改内容】 1. 品类支持范围扩展 - 新增支持的品类:3c、bags、pet_supplies、electronics、outdoor、home_appliances、home_living、wigs、beauty、accessories、toys、shoes、sports、others - 所有新品类在 taxonomy 输出阶段仅返回 en 字段,避免多语言字段膨胀 - 保留服饰(apparel)品类的双语输出(zh + en),维持原有业务兼容性 2. 核心代码重构 - `indexer/product_enrich.py` - 新增 `TAXONOMY_PROFILES` 注册表,以数据驱动方式定义每个品类的输出语言、prompt 映射、taxonomy 字段集合 - 重写 `_enrich_taxonomy_batch`:按 profile 分组批量调用 LLM,避免为每个品类编写独立分支 - 引入 `_infer_profile_from_category()` 函数,从 SPU 的 category 字段自动推断所属 profile(用于内部索引路径,解决混合目录默认 fallback 到服饰的问题) - `indexer/product_enrich_prompts.py` - 将原有单一服饰 prompt 重构为 `PROMPT_TEMPLATES` 字典,按 profile 存储不同提示词 - 所有非服饰品类共享一套精简提示模板,仅要求输出 en 字段 - `indexer/document_transformer.py` - 在构建 enrichment 请求时传递 category 信息,供下游按 profile 路由 - 调整 `_build_enrich_batch` 逻辑,使批量请求支持混合品类并正确分组 - `indexer/indexer.py`(API 层) - `/indexer/enrich-content` 接口的请求模型增加可选的 `category_profile` 字段,允许调用方显式指定品类;未指定时由服务端自动推断 - 更新参数校验与错误处理,新增对 `others` 等兜底品类的支持 3. 文档同步更新 - `docs/搜索API对接指南-05-索引接口(Indexer).md`:增加品类 profile 参数说明,标注非服饰品类 taxonomy 仅返回 en 字段 - `docs/搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md`:更新 enrichment 微服务的调用示例,体现多品类分组批处理 - `taxonomy.md`:补充各品类的字段清单,明确 en 字段为所有非服饰品类的唯一输出 【技术细节】 - **注册表设计**: ```python TAXONOMY_PROFILES = { "apparel": {"lang": ["zh", "en"], "prompt_key": "apparel", "fields": [...]}, "3c": {"lang": ["en"], "prompt_key": "default", "fields": [...]}, \# ... } ``` 新增品类只需在注册表中添加一项,并确保 `PROMPT_TEMPLATES` 中存在对应的 prompt_key,无需修改控制流逻辑。 - **按 profile 分组批处理**: - 原有实现:所有产品混在一起,使用同一套服饰 prompt,导致非服饰产品被错误填充。 - 重构后:`_enrich_taxonomy_batch` 先根据每个产品的 profile 分组,每组独立构造 LLM 请求,响应结果再按原始顺序合并。分组粒度可配置,避免小分组带来的过多请求开销。 - **自动品类推断**: - 对于内部索引(非显式调用 enrichment 接口的场景),通过 `_infer_profile_from_category` 解析 SPU 的 `category_l1/l2/l3` 字段,映射到最匹配的 profile。映射规则基于关键词匹配(如“手机”->“3c”,“狗粮”->“pet_supplies”),未匹配时 fallback 到 `apparel` 以保证系统平稳过渡。 - **输出字段裁剪**: - 由于 Elasticsearch mapping 中 `enriched_taxonomy_attributes.value` 字段仅存储单个值(不分语言),非服饰品类的 LLM 输出直接写入该字段;服饰品类则使用动态模板 `value.zh` 和 `value.en`。代码中通过 `_apply_lang_output` 函数统一处理。 - **代码量与可维护性**: - 虽然因新增大量品类定义导致总行数略有增长(~+180 行),但条件分支数量从 5 处减少到 1 处(仅 profile 查找)。新增品类的平均成本仅为注册表 3 行 + prompt 模板 10 行,无需改动核心 enrichment 循环。 【影响文件】 - `indexer/product_enrich.py` - `indexer/product_enrich_prompts.py` - `indexer/document_transformer.py` - `indexer/indexer.py` - `docs/搜索API对接指南-05-索引接口(Indexer).md` - `docs/搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md` - `taxonomy.md` - `tests/test_product_enrich_partial_mode.py`(适配多 profile 测试用例) - `tests/test_llm_enrichment_batch_fill.py` - `tests/test_process_products_batching.py` 【测试验证】 - 执行单元测试与集成测试:`pytest tests/test_product_enrich_partial_mode.py tests/test_llm_enrichment_batch_fill.py tests/test_process_products_batching.py tests/ci/test_service_api_contracts.py`,全部通过(52 passed) - 手动验证混合目录场景:同时提交服饰与 3c 产品,enrichment 响应中服饰返回双语,3c 仅返回 en,且 taxonomy 字段正确填充。 - 编译检查:`py_compile` 所有修改模块无语法错误。 【注意事项】 - 本次重构未改变现有服饰品类的行为,API 向后兼容(未指定 profile 时仍按服饰处理)。 - 若后续需为某品类增加双语支持,只需修改注册表中的 `lang` 列表并补充 prompt 模板,无需改动其他逻辑。 -
2. 删掉自动推断 taxonomy profile的逻辑,build_index_content_fields() 3. 所有 taxonomy profile 都输出 zh/en”,并把按行业切语言的逻辑去掉 只接受显式传入的 category_taxonomy_profile
-
问题背景: - scripts/ 目录下混有服务启动、数据转换、性能压测、临时脚本及历史备份目录 - 存在大量中间迭代遗留信息,不利于维护和新人理解 - 现行服务编排已稳定为 service_ctl up all 的集合:tei / cnclip / embedding / embedding-image / translator / reranker / backend / indexer / frontend / eval-web,不再保留 reranker-fine 默认位 调整内容: 1. 根 scripts/ 收敛为运行、运维、环境、数据处理脚本,并新增 scripts/README.md 说明文档 2. 性能/压测/调参脚本整体迁至 benchmarks/ 目录,同步更新 benchmarks/README.md 3. 人工试跑脚本迁至 tests/manual/ 目录,同步更新 tests/manual/README.md 4. 删除明确过时内容: - scripts/indexer__old_2025_11/ - scripts/start.sh - scripts/install_server_deps.sh 5. 同步修正以下文档中的路径及过时描述: - 根目录 README.md - 性能报告相关文档 - reranker/translation 模块文档 技术细节: - 性能测试不放常规 tests/ 的原因:这类脚本依赖真实服务、GPU、模型和环境噪声,不适合作为稳定回归门禁;benchmarks/ 更贴合其定位 - tests/manual/ 仅存放需要人工启动依赖、手工观察结果的接口试跑脚本 - 所有迁移后的 Python 脚本已通过 py_compile 语法校验 - 所有迁移后的 Shell 脚本已通过 bash -n 语法校验 校验结果: - py_compile: 通过 - bash -n: 通过
-
- 数据转换放到 scripts/data_import/README.md - 诊断巡检放到 scripts/inspect/README.md - 运维辅助放到 scripts/ops/README.md - 前端辅助服务放到 scripts/frontend/frontend_server.py - 翻译模型下载放到 scripts/translation/download_translation_models.py - 临时图片补 embedding 脚本收敛成 scripts/maintenance/embed_tenant_image_urls.py - Redis 监控脚本并入 redis/,现在是 scripts/redis/monitor_eviction.py 同时我把真实调用链都改到了新位置: - scripts/start_frontend.sh - scripts/start_cnclip_service.sh - scripts/service_ctl.sh - scripts/setup_translator_venv.sh - scripts/README.md 文档里涉及这些脚本的路径也同步修了,主要是 docs/QUICKSTART.md 和 translation/README.md。 -
2. +service_enabled_by_config() { reranker|reranker-fine|translator 如果被关闭,则run.sh all 不启动该服务 -
背景与问题 - 现有粗排/重排依赖 `knn_query` 和 `image_knn_query` 分数,但这两路分数来自 ANN 召回,并非所有进入 rerank_window (160) 的文档都同时命中文本和图片向量召回,导致部分文档得分为 0,影响融合公式的稳定性。 - 简单扩大 ANN 的 k 无法保证 lexical 召回带来的文档也包含两路向量分;二次查询或拉回向量本地计算均有额外开销且实现复杂。 解决方案 采用 ES rescore 机制,在第一次搜索的 `window_size` 内对每个文档执行精确的向量 script_score,并将分数以 named query 形式附加到 `matched_queries` 中,供后续 coarse/rerank 优先使用。 **设计决策**: - **只补分,不改排序**:rescore 使用 `score_mode: total` 且 `rescore_query_weight: 0.0`,原始 `_score` 保持不变,避免干扰现有排序逻辑,风险最小。 - **精确分数命名**:`exact_text_knn_query` 和 `exact_image_knn_query`,便于客户端识别和回退。 - **可配置**:通过 `exact_knn_rescore_enabled` 开关和 `exact_knn_rescore_window` 控制窗口大小,默认 160。 技术实现细节 1. 配置扩展 (`config/config.yaml`, `config/loader.py`) ```yaml exact_knn_rescore_enabled: true exact_knn_rescore_window: 160 ``` 新增配置项并注入到 `RerankConfig`。 2. Searcher 构建 rescore 查询 (`search/searcher.py`) - 在 `_build_es_search_request` 中,当 `enable_rerank=True` 且配置开启时,构造 rescore 对象: - `window_size` = `exact_knn_rescore_window` - `query` 为一个 `bool` 查询,内嵌两个 `script_score` 子查询,分别计算文本和图片向量的点积相似度: ```painless // exact_text_knn_query (dotProduct(params.query_vector, 'title_embedding') + 1.0) / 2.0 // exact_image_knn_query (dotProduct(params.image_query_vector, 'image_embedding.vector') + 1.0) / 2.0 ``` - 每个 `script_score` 都设置 `_name` 为对应的 named query。 - 注意:当前实现的脚本分数**尚未乘以 knn_text_boost / knn_image_boost**,保持与原始 ANN 分数尺度对齐的后续待办。 3. RerankClient 优先读取 exact 分数 (`search/rerank_client.py`) - 在 `_extract_coarse_signals` 中,从文档的 `matched_queries` 里读取 `exact_text_knn_query` 和 `exact_image_knn_query` 分数。 - 若存在且值有效,则用作 `text_knn_score` / `image_knn_score`,并标记 `text_knn_source='exact_text_knn_query'`。 - 若不存在,则回退到原有的 `knn_query` / `image_knn_query` (ANN 分数)。 - 同时保留原始 ANN 分数到 `approx_text_knn_score` / `approx_image_knn_score` 供调试对比。 4. 调试信息增强 - `debug_info.per_result[*].ranking_funnel.coarse_rank.signals` 中输出 exact 分数、回退分数及来源标记,便于线上观察覆盖率和数值分布。 验证结果 - 通过单元测试 `tests/test_rerank_client.py` 和 `tests/test_search_rerank_window.py`,验证 exact 优先级、配置解析及 ES 请求体结构。 - 线上真实查询采样(6 个 query,top160)显示: - **exact 覆盖率达到 100%**(文本和图片均有分),解决了原 ANN 部分缺失的问题。 - 但 exact 分数与原始 ANN 分数存在量级差异(ANN/exact 中位数比值约 4.1 倍),原因是 exact 脚本未乘 boost 因子。 - 当前排名影响:粗排 top10 重叠度最低降至 1/10,最大排名漂移超过 100。 后续计划 1. 对齐 exact 分与 ANN 分的尺度:在 script_score 中乘以 `knn_text_boost` / `knn_image_boost`,并对长查询额外乘 1.4。 2. 重新评估 top10 重叠度和漂移,若收敛则可将 coarse 融合公式整体迁移至 ES rescore 阶段。 3. 当前版本保持“只补分不改排序”的安全策略,已解决核心的分数缺失问题。 涉及文件 - `config/config.yaml` - `config/loader.py` - `search/searcher.py` - `search/rerank_client.py` - `tests/test_rerank_client.py` - `tests/test_search_rerank_window.py` -
修改内容 1. **新增配置项** (`config/config.yaml`) - `exact_knn_rescore_enabled`: 是否开启精确向量重打分,默认 true - `exact_knn_rescore_window`: 重打分窗口大小,默认 160(与 rerank_window 解耦,可独立配置) 2. **ES 查询层改造** (`search/searcher.py`) - 在第一次 ES 搜索中,根据配置为 window_size 内的文档注入 rescore 阶段 - rescore_query 中包含两个 named script_score 子句: - `exact_text_knn_query`: 对文本向量执行精确点积 - `exact_image_knn_query`: 对图片向量执行精确点积 - 当前采用 `score_mode=total` 且 `rescore_query_weight=0.0`,**只补分不改排序**,exact 分仅出现在 `matched_queries` 中 3. **统一向量得分 Boost 逻辑** (`search/es_query_builder.py`) - 新增 `_get_knn_plan()` 方法,集中管理文本/图片 KNN 的 boost 计算规则 - 支持长查询(token 数超过阈值)时文本 boost 额外乘 1.4 倍 - 精确 rescore 与 ANN 召回**共用同一套 boost 规则**,确保分数量纲一致 - 原有 ANN 查询构建逻辑同步迁移至该统一入口 4. **融合阶段得分优先级调整** (`search/rerank_client.py`) - `_build_hit_signal_bundle()` 中统一处理向量得分读取 - 优先从 `matched_queries` 读取 `exact_text_knn_query` / `exact_image_knn_query` - 若不存在则回退到原 `knn_query` / `image_knn_query`(ANN 得分) - 覆盖 coarse_rank、fine_rank、rerank 三个阶段,避免重复补丁 5. **测试覆盖** - `tests/test_es_query_builder.py`: 验证 ANN 与 exact 共用 boost 规则 - `tests/test_search_rerank_window.py`: 验证 rescore 窗口及 named query 正确注入 - `tests/test_rerank_client.py`: 验证 exact 优先、回退 ANN 的逻辑 技术细节 - **精确向量计算脚本** (Painless) ```painless // 文本 (dotProduct + 1.0) / 2.0 (dotProduct(params.query_vector, 'title_embedding') + 1.0) / 2.0 // 图片同理,字段为 'image_embedding.vector' ``` 乘以统一的 boost(来自配置 `knn_text_boost` / `knn_image_boost` 及长查询放大因子)。 - **named query 保留机制** - 主查询中已开启 `include_named_queries_score: true` - rescore 阶段命名的脚本得分会合并到每个 hit 的 `matched_queries` 中 - 通过 `_extract_named_score()` 按名称提取,与原始 ANN 得分访问方式完全一致 - **性能影响** (基于 top160、6 条真实查询、warm-up 后 3 轮平均) - `elasticsearch_search_primary` 耗时: 124.71ms → 136.60ms (+11.89ms, +9.53%) - `total_search` 受其他组件抖动影响较大,不作为主要参考 - 该开销在可接受范围内,未出现超时或资源瓶颈 配置示例 ```yaml search: exact_knn_rescore_enabled: true exact_knn_rescore_window: 160 knn_text_boost: 4.0 knn_image_boost: 4.0 long_query_token_threshold: 8 long_query_text_boost_factor: 1.4 ``` 已知问题与后续计划 - 当前版本经过调参实验发现,开启 exact rescore 后部分 query(强类型约束 + 多风格/颜色相似)的主指标相比 baseline(exact=false)下降约 0.031(0.6009 → 0.5697) - 根因:exact 将 KNN 从稀疏辅助信号变为 dense 排序因子,coarse 阶段排序语义变化,单纯调整现有 `knn_bias/exponent` 无法完全恢复 - 后续迭代方向:**coarse 阶段暂不强制使用 exact**,仅 fine/rerank 优先 exact;或 coarse 采用“ANN 优先,exact 只补缺失”策略,再重新评估 相关文件 - `config/config.yaml` - `search/searcher.py` - `search/es_query_builder.py` - `search/rerank_client.py` - `tests/test_es_query_builder.py` - `tests/test_search_rerank_window.py` - `tests/test_rerank_client.py` - `scripts/evaluation/exact_rescore_coarse_tuning_round2.json` (调参实验记录)
Showing
137 changed files
Show diff stats
Too many changes.
To preserve performance only 100 of 137 files are displayed.
.env.example
CLAUDE.md
| ... | ... | @@ -77,9 +77,11 @@ source activate.sh |
| 77 | 77 | # Generate test data (Tenant1 Mock + Tenant2 CSV) |
| 78 | 78 | ./scripts/mock_data.sh |
| 79 | 79 | |
| 80 | -# Ingest data to Elasticsearch | |
| 81 | -./scripts/ingest.sh <tenant_id> [recreate] # e.g., ./scripts/ingest.sh 1 true | |
| 82 | -python main.py ingest data.csv --limit 1000 --batch-size 50 | |
| 80 | +# Create tenant index structure | |
| 81 | +./scripts/create_tenant_index.sh <tenant_id> | |
| 82 | + | |
| 83 | +# Build / refresh suggestion index | |
| 84 | +./scripts/build_suggestions.sh <tenant_id> --mode incremental | |
| 83 | 85 | ``` |
| 84 | 86 | |
| 85 | 87 | ### Running Services |
| ... | ... | @@ -100,10 +102,10 @@ python main.py serve --host 0.0.0.0 --port 6002 --reload |
| 100 | 102 | # Run all tests |
| 101 | 103 | pytest tests/ |
| 102 | 104 | |
| 103 | -# Run specific test types | |
| 104 | -pytest tests/unit/ # Unit tests | |
| 105 | -pytest tests/integration/ # Integration tests | |
| 106 | -pytest -m "api" # API tests only | |
| 105 | +# Run focused regression sets | |
| 106 | +python -m pytest tests/ci -q | |
| 107 | +pytest tests/test_rerank_client.py | |
| 108 | +pytest tests/test_query_parser_mixed_language.py | |
| 107 | 109 | |
| 108 | 110 | # Test search from command line |
| 109 | 111 | python main.py search "query" --tenant-id 1 --size 10 |
| ... | ... | @@ -114,12 +116,8 @@ python main.py search "query" --tenant-id 1 --size 10 |
| 114 | 116 | # Stop all services |
| 115 | 117 | ./scripts/stop.sh |
| 116 | 118 | |
| 117 | -# Test environment (for CI/development) | |
| 118 | -./scripts/start_test_environment.sh | |
| 119 | -./scripts/stop_test_environment.sh | |
| 120 | - | |
| 121 | -# Install server dependencies | |
| 122 | -./scripts/install_server_deps.sh | |
| 119 | +# Run CI contract tests | |
| 120 | +./scripts/run_ci_tests.sh | |
| 123 | 121 | ``` |
| 124 | 122 | |
| 125 | 123 | ## Architecture Overview |
| ... | ... | @@ -585,7 +583,7 @@ GET /admin/stats # Index statistics |
| 585 | 583 | ./scripts/start_frontend.sh # Frontend UI (port 6003) |
| 586 | 584 | |
| 587 | 585 | # Data Operations |
| 588 | -./scripts/ingest.sh <tenant_id> [recreate] # Index data | |
| 586 | +./scripts/create_tenant_index.sh <tenant_id> # Create tenant index | |
| 589 | 587 | ./scripts/mock_data.sh # Generate test data |
| 590 | 588 | |
| 591 | 589 | # Testing | ... | ... |
api/models.py
| ... | ... | @@ -154,7 +154,8 @@ class SearchRequest(BaseModel): |
| 154 | 154 | enable_rerank: Optional[bool] = Field( |
| 155 | 155 | None, |
| 156 | 156 | description=( |
| 157 | - "是否开启重排(调用外部重排服务对 ES 结果进行二次排序)。" | |
| 157 | + "是否开启最终重排(调用外部 rerank 服务改写上一阶段顺序)。" | |
| 158 | + "关闭时仍保留 coarse/fine 流程,仅在 rerank 阶段保序透传。" | |
| 158 | 159 | "不传则使用服务端配置 rerank.enabled(默认开启)。" |
| 159 | 160 | ) |
| 160 | 161 | ) | ... | ... |
api/routes/indexer.py
| ... | ... | @@ -7,7 +7,7 @@ |
| 7 | 7 | import asyncio |
| 8 | 8 | import re |
| 9 | 9 | from fastapi import APIRouter, HTTPException |
| 10 | -from typing import Any, Dict, List, Optional | |
| 10 | +from typing import Any, Dict, List, Literal, Optional | |
| 11 | 11 | from pydantic import BaseModel, Field |
| 12 | 12 | import logging |
| 13 | 13 | from sqlalchemy import text |
| ... | ... | @@ -19,6 +19,11 @@ logger = logging.getLogger(__name__) |
| 19 | 19 | |
| 20 | 20 | router = APIRouter(prefix="/indexer", tags=["indexer"]) |
| 21 | 21 | |
| 22 | +SUPPORTED_CATEGORY_TAXONOMY_PROFILES = ( | |
| 23 | + "apparel, 3c, bags, pet_supplies, electronics, outdoor, " | |
| 24 | + "home_appliances, home_living, wigs, beauty, accessories, toys, shoes, sports, others" | |
| 25 | +) | |
| 26 | + | |
| 22 | 27 | |
| 23 | 28 | class ReindexRequest(BaseModel): |
| 24 | 29 | """全量重建索引请求""" |
| ... | ... | @@ -88,11 +93,42 @@ class EnrichContentItem(BaseModel): |
| 88 | 93 | |
| 89 | 94 | class EnrichContentRequest(BaseModel): |
| 90 | 95 | """ |
| 91 | - 内容理解字段生成请求:根据商品标题批量生成 qanchors、enriched_attributes、tags。 | |
| 96 | + 内容理解字段生成请求:根据商品标题批量生成通用增强字段与品类 taxonomy 字段。 | |
| 92 | 97 | 供外部 indexer 在自行组织 doc 时调用,与翻译、向量化等微服务并列。 |
| 93 | 98 | """ |
| 94 | 99 | tenant_id: str = Field(..., description="租户 ID,用于请求路由与结果归属,不参与缓存键") |
| 95 | 100 | items: List[EnrichContentItem] = Field(..., description="待分析的 SPU 列表(spu_id + title,可附带 brief/description/image_url)") |
| 101 | + enrichment_scopes: Optional[List[Literal["generic", "category_taxonomy"]]] = Field( | |
| 102 | + default=None, | |
| 103 | + description=( | |
| 104 | + "要执行的增强范围。" | |
| 105 | + "`generic` 返回 qanchors/enriched_tags/enriched_attributes;" | |
| 106 | + "`category_taxonomy` 返回 enriched_taxonomy_attributes。" | |
| 107 | + "默认两者都执行。" | |
| 108 | + ), | |
| 109 | + ) | |
| 110 | + category_taxonomy_profile: str = Field( | |
| 111 | + "apparel", | |
| 112 | + description=( | |
| 113 | + "品类 taxonomy profile。默认 `apparel`。" | |
| 114 | + f"当前支持:{SUPPORTED_CATEGORY_TAXONOMY_PROFILES}。" | |
| 115 | + "其中除 `apparel` 外,其余 profile 的 taxonomy 输出仅返回 `en`。" | |
| 116 | + ), | |
| 117 | + ) | |
| 118 | + analysis_kinds: Optional[List[Literal["content", "taxonomy"]]] = Field( | |
| 119 | + default=None, | |
| 120 | + description="Deprecated alias of enrichment_scopes. `content` -> `generic`, `taxonomy` -> `category_taxonomy`.", | |
| 121 | + ) | |
| 122 | + | |
| 123 | + def resolved_enrichment_scopes(self) -> List[str]: | |
| 124 | + if self.enrichment_scopes: | |
| 125 | + return list(self.enrichment_scopes) | |
| 126 | + if self.analysis_kinds: | |
| 127 | + mapped = [] | |
| 128 | + for item in self.analysis_kinds: | |
| 129 | + mapped.append("generic" if item == "content" else "category_taxonomy") | |
| 130 | + return mapped | |
| 131 | + return ["generic", "category_taxonomy"] | |
| 96 | 132 | |
| 97 | 133 | |
| 98 | 134 | @router.post("/reindex") |
| ... | ... | @@ -440,20 +476,31 @@ async def build_docs_from_db(request: BuildDocsFromDbRequest): |
| 440 | 476 | raise HTTPException(status_code=500, detail=f"Internal server error: {str(e)}") |
| 441 | 477 | |
| 442 | 478 | |
| 443 | -def _run_enrich_content(tenant_id: str, items: List[Dict[str, str]]) -> List[Dict[str, Any]]: | |
| 479 | +def _run_enrich_content( | |
| 480 | + tenant_id: str, | |
| 481 | + items: List[Dict[str, str]], | |
| 482 | + enrichment_scopes: Optional[List[str]] = None, | |
| 483 | + category_taxonomy_profile: str = "apparel", | |
| 484 | +) -> List[Dict[str, Any]]: | |
| 444 | 485 | """ |
| 445 | 486 | 同步执行内容理解,返回与 ES mapping 对齐的字段结构。 |
| 446 | 487 | 语言策略由 product_enrich 内部统一决定,路由层不参与。 |
| 447 | 488 | """ |
| 448 | 489 | from indexer.product_enrich import build_index_content_fields |
| 449 | 490 | |
| 450 | - results = build_index_content_fields(items=items, tenant_id=tenant_id) | |
| 491 | + results = build_index_content_fields( | |
| 492 | + items=items, | |
| 493 | + tenant_id=tenant_id, | |
| 494 | + enrichment_scopes=enrichment_scopes, | |
| 495 | + category_taxonomy_profile=category_taxonomy_profile, | |
| 496 | + ) | |
| 451 | 497 | return [ |
| 452 | 498 | { |
| 453 | 499 | "spu_id": item["id"], |
| 454 | 500 | "qanchors": item["qanchors"], |
| 455 | 501 | "enriched_attributes": item["enriched_attributes"], |
| 456 | 502 | "enriched_tags": item["enriched_tags"], |
| 503 | + "enriched_taxonomy_attributes": item["enriched_taxonomy_attributes"], | |
| 457 | 504 | **({"error": item["error"]} if item.get("error") else {}), |
| 458 | 505 | } |
| 459 | 506 | for item in results |
| ... | ... | @@ -463,15 +510,15 @@ def _run_enrich_content(tenant_id: str, items: List[Dict[str, str]]) -> List[Dic |
| 463 | 510 | @router.post("/enrich-content") |
| 464 | 511 | async def enrich_content(request: EnrichContentRequest): |
| 465 | 512 | """ |
| 466 | - 内容理解字段生成接口:根据商品标题批量生成 qanchors、enriched_attributes、tags。 | |
| 513 | + 内容理解字段生成接口:根据商品标题批量生成通用增强字段与品类 taxonomy 字段。 | |
| 467 | 514 | |
| 468 | 515 | 使用场景: |
| 469 | 516 | - 外部 indexer 采用「微服务组合」方式自己组织 doc 时,可调用本接口获取 LLM 生成的 |
| 470 | 517 | 锚文本与语义属性,再与翻译、向量化结果合并写入 ES。 |
| 471 | 518 | - 与 /indexer/build-docs 解耦,避免 build-docs 因 LLM 耗时过长而阻塞;调用方可 |
| 472 | - 先拿不含 qanchors/enriched_tags 的 doc,再异步或离线补齐本接口结果后更新 ES。 | |
| 519 | + 先拿不含 qanchors/enriched_tags/taxonomy attributes 的 doc,再异步或离线补齐本接口结果后更新 ES。 | |
| 473 | 520 | |
| 474 | - 实现逻辑与 indexer.product_enrich.analyze_products 一致,支持多语言与 Redis 缓存。 | |
| 521 | + 实现逻辑与 indexer.product_enrich.build_index_content_fields 一致,支持多语言与 Redis 缓存。 | |
| 475 | 522 | """ |
| 476 | 523 | try: |
| 477 | 524 | if not request.items: |
| ... | ... | @@ -493,15 +540,20 @@ async def enrich_content(request: EnrichContentRequest): |
| 493 | 540 | for it in request.items |
| 494 | 541 | ] |
| 495 | 542 | loop = asyncio.get_event_loop() |
| 543 | + enrichment_scopes = request.resolved_enrichment_scopes() | |
| 496 | 544 | result = await loop.run_in_executor( |
| 497 | 545 | None, |
| 498 | 546 | lambda: _run_enrich_content( |
| 499 | 547 | tenant_id=request.tenant_id, |
| 500 | - items=items_payload | |
| 548 | + items=items_payload, | |
| 549 | + enrichment_scopes=enrichment_scopes, | |
| 550 | + category_taxonomy_profile=request.category_taxonomy_profile, | |
| 501 | 551 | ), |
| 502 | 552 | ) |
| 503 | 553 | return { |
| 504 | 554 | "tenant_id": request.tenant_id, |
| 555 | + "enrichment_scopes": enrichment_scopes, | |
| 556 | + "category_taxonomy_profile": request.category_taxonomy_profile, | |
| 505 | 557 | "results": result, |
| 506 | 558 | "total": len(result), |
| 507 | 559 | } | ... | ... |
api/translator_app.py
| ... | ... | @@ -271,16 +271,20 @@ async def lifespan(_: FastAPI): |
| 271 | 271 | """Initialize all enabled translation backends on process startup.""" |
| 272 | 272 | logger.info("Starting Translation Service API") |
| 273 | 273 | service = get_translation_service() |
| 274 | + failed_models = list(getattr(service, "failed_models", [])) | |
| 275 | + backend_errors = dict(getattr(service, "backend_errors", {})) | |
| 274 | 276 | logger.info( |
| 275 | - "Translation service ready | default_model=%s default_scene=%s available_models=%s loaded_models=%s", | |
| 277 | + "Translation service ready | default_model=%s default_scene=%s available_models=%s loaded_models=%s failed_models=%s", | |
| 276 | 278 | service.config["default_model"], |
| 277 | 279 | service.config["default_scene"], |
| 278 | 280 | service.available_models, |
| 279 | 281 | service.loaded_models, |
| 282 | + failed_models, | |
| 280 | 283 | ) |
| 281 | 284 | logger.info( |
| 282 | - "Translation backends initialized on startup | models=%s", | |
| 285 | + "Translation backends initialized on startup | loaded=%s failed=%s", | |
| 283 | 286 | service.loaded_models, |
| 287 | + backend_errors, | |
| 284 | 288 | ) |
| 285 | 289 | verbose_logger.info( |
| 286 | 290 | "Translation startup detail | capabilities=%s cache_ttl_seconds=%s cache_sliding_expiration=%s", |
| ... | ... | @@ -316,11 +320,14 @@ async def health_check(): |
| 316 | 320 | """Health check endpoint.""" |
| 317 | 321 | try: |
| 318 | 322 | service = get_translation_service() |
| 323 | + failed_models = list(getattr(service, "failed_models", [])) | |
| 324 | + backend_errors = dict(getattr(service, "backend_errors", {})) | |
| 319 | 325 | logger.info( |
| 320 | - "Health check | default_model=%s default_scene=%s loaded_models=%s", | |
| 326 | + "Health check | default_model=%s default_scene=%s loaded_models=%s failed_models=%s", | |
| 321 | 327 | service.config["default_model"], |
| 322 | 328 | service.config["default_scene"], |
| 323 | 329 | service.loaded_models, |
| 330 | + failed_models, | |
| 324 | 331 | ) |
| 325 | 332 | return { |
| 326 | 333 | "status": "healthy", |
| ... | ... | @@ -330,6 +337,8 @@ async def health_check(): |
| 330 | 337 | "available_models": service.available_models, |
| 331 | 338 | "enabled_capabilities": get_enabled_translation_models(service.config), |
| 332 | 339 | "loaded_models": service.loaded_models, |
| 340 | + "failed_models": failed_models, | |
| 341 | + "backend_errors": backend_errors, | |
| 333 | 342 | } |
| 334 | 343 | except Exception as e: |
| 335 | 344 | logger.error(f"Health check failed: {e}") |
| ... | ... | @@ -463,6 +472,10 @@ async def translate(request: TranslationRequest, http_request: Request): |
| 463 | 472 | latency_ms = (time.perf_counter() - request_started) * 1000 |
| 464 | 473 | logger.warning("Translation validation error | error=%s latency_ms=%.2f", e, latency_ms) |
| 465 | 474 | raise HTTPException(status_code=400, detail=str(e)) from e |
| 475 | + except RuntimeError as e: | |
| 476 | + latency_ms = (time.perf_counter() - request_started) * 1000 | |
| 477 | + logger.warning("Translation backend unavailable | error=%s latency_ms=%.2f", e, latency_ms) | |
| 478 | + raise HTTPException(status_code=503, detail=str(e)) from e | |
| 466 | 479 | except Exception as e: |
| 467 | 480 | latency_ms = (time.perf_counter() - request_started) * 1000 |
| 468 | 481 | logger.error("Translation error | error=%s latency_ms=%.2f", e, latency_ms, exc_info=True) | ... | ... |
| ... | ... | @@ -0,0 +1,17 @@ |
| 1 | +# Benchmarks | |
| 2 | + | |
| 3 | +基准压测脚本统一放在 `benchmarks/`,不再和 `scripts/` 里的服务启动/运维脚本混放。 | |
| 4 | + | |
| 5 | +目录约定: | |
| 6 | + | |
| 7 | +- `benchmarks/perf_api_benchmark.py`:通用 HTTP 接口压测入口 | |
| 8 | +- `benchmarks/reranker/`:reranker 定向 benchmark、smoke、手工对比脚本 | |
| 9 | +- `benchmarks/translation/`:translation 本地模型 benchmark | |
| 10 | + | |
| 11 | +这些脚本默认不是 CI 测试的一部分,因为它们通常具备以下特征: | |
| 12 | + | |
| 13 | +- 依赖真实服务、GPU、模型或特定数据集 | |
| 14 | +- 结果受机器配置和运行时负载影响,不适合作为稳定回归门禁 | |
| 15 | +- 更多用于容量评估、调参和问题复现,而不是功能正确性判定 | |
| 16 | + | |
| 17 | +如果某个性能场景需要进入自动化回归,应新增到 `tests/` 下并明确收敛输入、环境和判定阈值,而不是直接复用这里的基准脚本。 | ... | ... |
scripts/perf_api_benchmark.py renamed to benchmarks/perf_api_benchmark.py
| ... | ... | @@ -11,13 +11,13 @@ Default scenarios (aligned with docs/搜索API对接指南 分册,如 -01 / -0 |
| 11 | 11 | - rerank POST /rerank |
| 12 | 12 | |
| 13 | 13 | Examples: |
| 14 | - python scripts/perf_api_benchmark.py --scenario backend_search --duration 30 --concurrency 20 --tenant-id 162 | |
| 15 | - python scripts/perf_api_benchmark.py --scenario backend_suggest --duration 30 --concurrency 50 --tenant-id 162 | |
| 16 | - python scripts/perf_api_benchmark.py --scenario all --duration 60 --concurrency 80 --tenant-id 162 | |
| 17 | - python scripts/perf_api_benchmark.py --scenario all --cases-file scripts/perf_cases.json.example --output perf_result.json | |
| 14 | + python benchmarks/perf_api_benchmark.py --scenario backend_search --duration 30 --concurrency 20 --tenant-id 162 | |
| 15 | + python benchmarks/perf_api_benchmark.py --scenario backend_suggest --duration 30 --concurrency 50 --tenant-id 162 | |
| 16 | + python benchmarks/perf_api_benchmark.py --scenario all --duration 60 --concurrency 80 --tenant-id 162 | |
| 17 | + python benchmarks/perf_api_benchmark.py --scenario all --cases-file benchmarks/perf_cases.json.example --output perf_result.json | |
| 18 | 18 | # Embedding admission / priority (query param `priority`; same semantics as embedding service): |
| 19 | - python scripts/perf_api_benchmark.py --scenario embed_text --embed-text-priority 1 --duration 30 --concurrency 20 | |
| 20 | - python scripts/perf_api_benchmark.py --scenario embed_image --embed-image-priority 1 --duration 30 --concurrency 10 | |
| 19 | + python benchmarks/perf_api_benchmark.py --scenario embed_text --embed-text-priority 1 --duration 30 --concurrency 20 | |
| 20 | + python benchmarks/perf_api_benchmark.py --scenario embed_image --embed-image-priority 1 --duration 30 --concurrency 10 | |
| 21 | 21 | """ |
| 22 | 22 | |
| 23 | 23 | from __future__ import annotations |
| ... | ... | @@ -229,7 +229,7 @@ def apply_embed_priority_params( |
| 229 | 229 | ) -> None: |
| 230 | 230 | """ |
| 231 | 231 | Merge default `priority` query param into embed templates when absent. |
| 232 | - `scripts/perf_cases.json` may set per-request `params.priority` to override. | |
| 232 | + `benchmarks/perf_cases.json` may set per-request `params.priority` to override. | |
| 233 | 233 | """ |
| 234 | 234 | mapping = { |
| 235 | 235 | "embed_text": max(0, int(embed_text_priority)), | ... | ... |
scripts/perf_cases.json.example renamed to benchmarks/perf_cases.json.example
scripts/benchmark_reranker_1000docs.sh renamed to benchmarks/reranker/benchmark_reranker_1000docs.sh
| ... | ... | @@ -8,7 +8,7 @@ |
| 8 | 8 | # Outputs JSON reports under perf_reports/<date>/reranker_1000docs/ |
| 9 | 9 | # |
| 10 | 10 | # Usage: |
| 11 | -# ./scripts/benchmark_reranker_1000docs.sh | |
| 11 | +# ./benchmarks/reranker/benchmark_reranker_1000docs.sh | |
| 12 | 12 | # Optional env: |
| 13 | 13 | # BATCH_SIZES="24 32 48 64" |
| 14 | 14 | # C1_REQUESTS=4 |
| ... | ... | @@ -85,7 +85,7 @@ run_bench() { |
| 85 | 85 | local c="$2" |
| 86 | 86 | local req="$3" |
| 87 | 87 | local out="${OUT_DIR}/rerank_bs${bs}_c${c}_r${req}.json" |
| 88 | - .venv/bin/python scripts/perf_api_benchmark.py \ | |
| 88 | + .venv/bin/python benchmarks/perf_api_benchmark.py \ | |
| 89 | 89 | --scenario rerank \ |
| 90 | 90 | --tenant-id "${TENANT_ID}" \ |
| 91 | 91 | --reranker-base "${RERANK_BASE}" \ | ... | ... |
scripts/benchmark_reranker_gguf_local.py renamed to benchmarks/reranker/benchmark_reranker_gguf_local.py
| ... | ... | @@ -8,8 +8,8 @@ Runs the backend directly in a fresh process per config to measure: |
| 8 | 8 | - single-request rerank latency |
| 9 | 9 | |
| 10 | 10 | Example: |
| 11 | - ./.venv-reranker-gguf/bin/python scripts/benchmark_reranker_gguf_local.py | |
| 12 | - ./.venv-reranker-gguf-06b/bin/python scripts/benchmark_reranker_gguf_local.py --backend-name qwen3_gguf_06b --docs 400 | |
| 11 | + ./.venv-reranker-gguf/bin/python benchmarks/reranker/benchmark_reranker_gguf_local.py | |
| 12 | + ./.venv-reranker-gguf-06b/bin/python benchmarks/reranker/benchmark_reranker_gguf_local.py --backend-name qwen3_gguf_06b --docs 400 | |
| 13 | 13 | """ |
| 14 | 14 | |
| 15 | 15 | from __future__ import annotations | ... | ... |
scripts/benchmark_reranker_random_titles.py renamed to benchmarks/reranker/benchmark_reranker_random_titles.py
| ... | ... | @@ -10,10 +10,10 @@ Each invocation runs 3 warmup requests with n=400 first; those are not timed for |
| 10 | 10 | |
| 11 | 11 | Example: |
| 12 | 12 | source activate.sh |
| 13 | - python scripts/benchmark_reranker_random_titles.py 386 | |
| 14 | - python scripts/benchmark_reranker_random_titles.py 40,80,100 | |
| 15 | - python scripts/benchmark_reranker_random_titles.py 40,80,100 --repeat 3 --seed 42 | |
| 16 | - RERANK_BASE=http://127.0.0.1:6007 python scripts/benchmark_reranker_random_titles.py 200 | |
| 13 | + python benchmarks/reranker/benchmark_reranker_random_titles.py 386 | |
| 14 | + python benchmarks/reranker/benchmark_reranker_random_titles.py 40,80,100 | |
| 15 | + python benchmarks/reranker/benchmark_reranker_random_titles.py 40,80,100 --repeat 3 --seed 42 | |
| 16 | + RERANK_BASE=http://127.0.0.1:6007 python benchmarks/reranker/benchmark_reranker_random_titles.py 200 | |
| 17 | 17 | """ |
| 18 | 18 | |
| 19 | 19 | from __future__ import annotations | ... | ... |
tests/reranker_performance/curl1.sh renamed to benchmarks/reranker/manual/curl1.sh
tests/reranker_performance/curl1_simple.sh renamed to benchmarks/reranker/manual/curl1_simple.sh
tests/reranker_performance/curl2.sh renamed to benchmarks/reranker/manual/curl2.sh
tests/reranker_performance/rerank_performance_compare.sh renamed to benchmarks/reranker/manual/rerank_performance_compare.sh
scripts/patch_rerank_vllm_benchmark_config.py renamed to benchmarks/reranker/patch_rerank_vllm_benchmark_config.py
| ... | ... | @@ -73,7 +73,7 @@ def main() -> int: |
| 73 | 73 | p.add_argument( |
| 74 | 74 | "--config", |
| 75 | 75 | type=Path, |
| 76 | - default=Path(__file__).resolve().parent.parent / "config" / "config.yaml", | |
| 76 | + default=Path(__file__).resolve().parents[2] / "config" / "config.yaml", | |
| 77 | 77 | ) |
| 78 | 78 | p.add_argument("--backend", choices=("qwen3_vllm", "qwen3_vllm_score"), required=True) |
| 79 | 79 | p.add_argument( | ... | ... |
scripts/run_reranker_vllm_instruction_benchmark.sh renamed to benchmarks/reranker/run_reranker_vllm_instruction_benchmark.sh
| ... | ... | @@ -55,13 +55,13 @@ run_one() { |
| 55 | 55 | local jf="${OUT_DIR}/${backend}_${fmt}.json" |
| 56 | 56 | |
| 57 | 57 | echo "========== ${tag} ==========" |
| 58 | - "$PYTHON" "${ROOT}/scripts/patch_rerank_vllm_benchmark_config.py" \ | |
| 58 | + "$PYTHON" "${ROOT}/benchmarks/reranker/patch_rerank_vllm_benchmark_config.py" \ | |
| 59 | 59 | --backend "$backend" --instruction-format "$fmt" |
| 60 | 60 | |
| 61 | 61 | "${ROOT}/restart.sh" reranker |
| 62 | 62 | wait_health "$backend" "$fmt" |
| 63 | 63 | |
| 64 | - if ! "$PYTHON" "${ROOT}/scripts/benchmark_reranker_random_titles.py" \ | |
| 64 | + if ! "$PYTHON" "${ROOT}/benchmarks/reranker/benchmark_reranker_random_titles.py" \ | |
| 65 | 65 | 100,200,400,600,800,1000 \ |
| 66 | 66 | --repeat 5 \ |
| 67 | 67 | --seed 42 \ |
| ... | ... | @@ -82,7 +82,7 @@ run_one qwen3_vllm_score compact |
| 82 | 82 | run_one qwen3_vllm_score standard |
| 83 | 83 | |
| 84 | 84 | # Restore repo-default-style rerank settings (score + compact). |
| 85 | -"$PYTHON" "${ROOT}/scripts/patch_rerank_vllm_benchmark_config.py" \ | |
| 85 | +"$PYTHON" "${ROOT}/benchmarks/reranker/patch_rerank_vllm_benchmark_config.py" \ | |
| 86 | 86 | --backend qwen3_vllm_score --instruction-format compact |
| 87 | 87 | "${ROOT}/restart.sh" reranker |
| 88 | 88 | wait_health qwen3_vllm_score compact | ... | ... |
scripts/smoke_qwen3_vllm_score_backend.py renamed to benchmarks/reranker/smoke_qwen3_vllm_score_backend.py
| ... | ... | @@ -3,7 +3,7 @@ |
| 3 | 3 | Smoke test: load Qwen3VLLMScoreRerankerBackend (must run as a file, not stdin — vLLM spawn). |
| 4 | 4 | |
| 5 | 5 | Usage (from repo root, score venv): |
| 6 | - PYTHONPATH=. ./.venv-reranker-score/bin/python scripts/smoke_qwen3_vllm_score_backend.py | |
| 6 | + PYTHONPATH=. ./.venv-reranker-score/bin/python benchmarks/reranker/smoke_qwen3_vllm_score_backend.py | |
| 7 | 7 | |
| 8 | 8 | Same as production: vLLM child processes need the venv's ``bin`` on PATH (for pip's ``ninja`` when |
| 9 | 9 | vLLM auto-selects FLASHINFER on T4/Turing). ``start_reranker.sh`` exports that; this script prepends |
| ... | ... | @@ -20,8 +20,8 @@ import sys |
| 20 | 20 | import sysconfig |
| 21 | 21 | from pathlib import Path |
| 22 | 22 | |
| 23 | -# Repo root on sys.path when run as scripts/smoke_*.py | |
| 24 | -_ROOT = Path(__file__).resolve().parents[1] | |
| 23 | +# Repo root on sys.path when run from benchmarks/reranker/. | |
| 24 | +_ROOT = Path(__file__).resolve().parents[2] | |
| 25 | 25 | if str(_ROOT) not in sys.path: |
| 26 | 26 | sys.path.insert(0, str(_ROOT)) |
| 27 | 27 | ... | ... |
scripts/benchmark_nllb_t4_tuning.py renamed to benchmarks/translation/benchmark_nllb_t4_tuning.py
| ... | ... | @@ -11,12 +11,12 @@ from datetime import datetime |
| 11 | 11 | from pathlib import Path |
| 12 | 12 | from typing import Any, Dict, List, Tuple |
| 13 | 13 | |
| 14 | -PROJECT_ROOT = Path(__file__).resolve().parent.parent | |
| 14 | +PROJECT_ROOT = Path(__file__).resolve().parents[2] | |
| 15 | 15 | if str(PROJECT_ROOT) not in sys.path: |
| 16 | 16 | sys.path.insert(0, str(PROJECT_ROOT)) |
| 17 | 17 | |
| 18 | 18 | from config.services_config import get_translation_config |
| 19 | -from scripts.benchmark_translation_local_models import ( | |
| 19 | +from benchmarks.translation.benchmark_translation_local_models import ( | |
| 20 | 20 | benchmark_concurrency_case, |
| 21 | 21 | benchmark_serial_case, |
| 22 | 22 | build_environment_info, | ... | ... |
scripts/benchmark_translation_local_models.py renamed to benchmarks/translation/benchmark_translation_local_models.py
| ... | ... | @@ -22,7 +22,7 @@ from typing import Any, Dict, Iterable, List, Sequence |
| 22 | 22 | import torch |
| 23 | 23 | import transformers |
| 24 | 24 | |
| 25 | -PROJECT_ROOT = Path(__file__).resolve().parent.parent | |
| 25 | +PROJECT_ROOT = Path(__file__).resolve().parents[2] | |
| 26 | 26 | if str(PROJECT_ROOT) not in sys.path: |
| 27 | 27 | sys.path.insert(0, str(PROJECT_ROOT)) |
| 28 | 28 | ... | ... |
scripts/benchmark_translation_local_models_focus.py renamed to benchmarks/translation/benchmark_translation_local_models_focus.py
| ... | ... | @@ -11,12 +11,12 @@ from datetime import datetime |
| 11 | 11 | from pathlib import Path |
| 12 | 12 | from typing import Any, Dict, List |
| 13 | 13 | |
| 14 | -PROJECT_ROOT = Path(__file__).resolve().parent.parent | |
| 14 | +PROJECT_ROOT = Path(__file__).resolve().parents[2] | |
| 15 | 15 | if str(PROJECT_ROOT) not in sys.path: |
| 16 | 16 | sys.path.insert(0, str(PROJECT_ROOT)) |
| 17 | 17 | |
| 18 | 18 | from config.services_config import get_translation_config |
| 19 | -from scripts.benchmark_translation_local_models import ( | |
| 19 | +from benchmarks.translation.benchmark_translation_local_models import ( | |
| 20 | 20 | SCENARIOS, |
| 21 | 21 | benchmark_concurrency_case, |
| 22 | 22 | benchmark_serial_case, | ... | ... |
scripts/benchmark_translation_longtext_single.py renamed to benchmarks/translation/benchmark_translation_longtext_single.py
config/config.yaml
| 1 | -# Unified Configuration for Multi-Tenant Search Engine | |
| 2 | -# 统一配置文件,所有租户共用一套配置 | |
| 3 | -# 注意:索引结构由 mappings/search_products.json 定义,此文件只配置搜索行为 | |
| 4 | -# | |
| 5 | -# 约定:下列键为必填;进程环境变量可覆盖 infrastructure / runtime 中同名语义项 | |
| 6 | -#(如 ES_HOST、API_PORT 等),未设置环境变量时使用本文件中的值。 | |
| 7 | - | |
| 8 | -# Process / bind addresses (环境变量 APP_ENV、RUNTIME_ENV、ES_INDEX_NAMESPACE 可覆盖前两者的语义) | |
| 9 | 1 | runtime: |
| 10 | 2 | environment: prod |
| 11 | 3 | index_namespace: '' |
| ... | ... | @@ -21,8 +13,6 @@ runtime: |
| 21 | 13 | translator_port: 6006 |
| 22 | 14 | reranker_host: 0.0.0.0 |
| 23 | 15 | reranker_port: 6007 |
| 24 | - | |
| 25 | -# 基础设施连接(敏感项优先读环境变量:ES_*、REDIS_*、DB_*、DASHSCOPE_API_KEY、DEEPL_AUTH_KEY) | |
| 26 | 16 | infrastructure: |
| 27 | 17 | elasticsearch: |
| 28 | 18 | host: http://localhost:9200 |
| ... | ... | @@ -49,23 +39,12 @@ infrastructure: |
| 49 | 39 | secrets: |
| 50 | 40 | dashscope_api_key: null |
| 51 | 41 | deepl_auth_key: null |
| 52 | - | |
| 53 | -# Elasticsearch Index | |
| 54 | 42 | es_index_name: search_products |
| 55 | - | |
| 56 | -# 检索域 / 索引列表(可为空列表;每项字段均需显式给出) | |
| 57 | 43 | indexes: [] |
| 58 | - | |
| 59 | -# Config assets | |
| 60 | 44 | assets: |
| 61 | 45 | query_rewrite_dictionary_path: config/dictionaries/query_rewrite.dict |
| 62 | - | |
| 63 | -# Product content understanding (LLM enrich-content) configuration | |
| 64 | 46 | product_enrich: |
| 65 | 47 | max_workers: 40 |
| 66 | - | |
| 67 | -# 离线 / Web 相关性评估(scripts/evaluation、eval-web) | |
| 68 | -# CLI 未显式传参时使用此处默认值;search_base_url 未配置时自动为 http://127.0.0.1:{runtime.api_port} | |
| 69 | 48 | search_evaluation: |
| 70 | 49 | artifact_root: artifacts/search_evaluation |
| 71 | 50 | queries_file: scripts/evaluation/queries/queries.txt |
| ... | ... | @@ -74,10 +53,10 @@ search_evaluation: |
| 74 | 53 | search_base_url: '' |
| 75 | 54 | web_host: 0.0.0.0 |
| 76 | 55 | web_port: 6010 |
| 77 | - judge_model: qwen3.5-plus | |
| 56 | + judge_model: qwen3.6-plus | |
| 78 | 57 | judge_enable_thinking: false |
| 79 | 58 | judge_dashscope_batch: false |
| 80 | - intent_model: qwen3-max | |
| 59 | + intent_model: qwen3.6-plus | |
| 81 | 60 | intent_enable_thinking: true |
| 82 | 61 | judge_batch_completion_window: 24h |
| 83 | 62 | judge_batch_poll_interval_sec: 10.0 |
| ... | ... | @@ -98,20 +77,17 @@ search_evaluation: |
| 98 | 77 | rebuild_irrelevant_stop_ratio: 0.799 |
| 99 | 78 | rebuild_irrel_low_combined_stop_ratio: 0.959 |
| 100 | 79 | rebuild_irrelevant_stop_streak: 3 |
| 101 | - | |
| 102 | -# ES Index Settings (基础设置) | |
| 103 | 80 | es_settings: |
| 104 | 81 | number_of_shards: 1 |
| 105 | 82 | number_of_replicas: 0 |
| 106 | 83 | refresh_interval: 30s |
| 107 | 84 | |
| 108 | -# 字段权重配置(用于搜索时的字段boost) | |
| 109 | -# 统一按“字段基名”配置;查询时按实际检索语言动态拼接 .{lang}。 | |
| 110 | -# 若需要按某个语言单独调权,也可以加显式 key(例如 title.de: 3.2)。 | |
| 85 | +# 统一按“字段基名”配置;查询时按实际检索语言动态拼接 .{lang} | |
| 111 | 86 | field_boosts: |
| 112 | 87 | title: 3.0 |
| 113 | - qanchors: 1.8 | |
| 114 | - enriched_tags: 1.8 | |
| 88 | + # qanchors enriched_tags 在 enriched_attributes.value中也存在,所以其实他的权重为自身权重+enriched_attributes.value的权重 | |
| 89 | + qanchors: 1.0 | |
| 90 | + enriched_tags: 1.0 | |
| 115 | 91 | enriched_attributes.value: 1.5 |
| 116 | 92 | category_name_text: 2.0 |
| 117 | 93 | category_path: 2.0 |
| ... | ... | @@ -124,38 +100,25 @@ field_boosts: |
| 124 | 100 | description: 1.0 |
| 125 | 101 | vendor: 1.0 |
| 126 | 102 | |
| 127 | -# Query Configuration(查询配置) | |
| 128 | 103 | query_config: |
| 129 | - # 支持的语言 | |
| 130 | 104 | supported_languages: |
| 131 | 105 | - zh |
| 132 | 106 | - en |
| 133 | 107 | default_language: en |
| 134 | - | |
| 135 | - # 功能开关(翻译开关由tenant_config控制) | |
| 136 | 108 | enable_text_embedding: true |
| 137 | 109 | enable_query_rewrite: true |
| 138 | 110 | |
| 139 | - # 查询翻译模型(须与 services.translation.capabilities 中某项一致) | |
| 140 | - # 源语种在租户 index_languages 内:主召回可打在源语种字段,用下面三项。 | |
| 141 | - zh_to_en_model: nllb-200-distilled-600m # "opus-mt-zh-en" | |
| 142 | - en_to_zh_model: nllb-200-distilled-600m # "opus-mt-en-zh" | |
| 143 | - default_translation_model: nllb-200-distilled-600m | |
| 144 | - # zh_to_en_model: deepl | |
| 145 | - # en_to_zh_model: deepl | |
| 146 | - # default_translation_model: deepl | |
| 147 | - # 源语种不在 index_languages:翻译对可检索文本更关键,可单独指定(缺省则与上一组相同) | |
| 148 | - zh_to_en_model__source_not_in_index: nllb-200-distilled-600m | |
| 149 | - en_to_zh_model__source_not_in_index: nllb-200-distilled-600m | |
| 150 | - default_translation_model__source_not_in_index: nllb-200-distilled-600m | |
| 151 | - # zh_to_en_model__source_not_in_index: deepl | |
| 152 | - # en_to_zh_model__source_not_in_index: deepl | |
| 153 | - # default_translation_model__source_not_in_index: deepl | |
| 111 | + zh_to_en_model: deepl # nllb-200-distilled-600m | |
| 112 | + en_to_zh_model: deepl | |
| 113 | + default_translation_model: deepl | |
| 114 | + # 源语种不在 index_languages时翻译质量比较重要,因此单独配置 | |
| 115 | + zh_to_en_model__source_not_in_index: deepl | |
| 116 | + en_to_zh_model__source_not_in_index: deepl | |
| 117 | + default_translation_model__source_not_in_index: deepl | |
| 154 | 118 | |
| 155 | - # 查询解析阶段:翻译与 query 向量并发执行,共用同一等待预算(毫秒)。 | |
| 156 | - # 检测语言已在租户 index_languages 内:较短;不在索引语言内:较长(翻译对召回更关键)。 | |
| 157 | - translation_embedding_wait_budget_ms_source_in_index: 300 # 80 | |
| 158 | - translation_embedding_wait_budget_ms_source_not_in_index: 400 # 200 | |
| 119 | + # 查询解析阶段:翻译与 query 向量并发执行,共用同一等待预算(毫秒) | |
| 120 | + translation_embedding_wait_budget_ms_source_in_index: 300 | |
| 121 | + translation_embedding_wait_budget_ms_source_not_in_index: 400 | |
| 159 | 122 | style_intent: |
| 160 | 123 | enabled: true |
| 161 | 124 | selected_sku_boost: 1.2 |
| ... | ... | @@ -182,17 +145,15 @@ query_config: |
| 182 | 145 | product_title_exclusion: |
| 183 | 146 | enabled: true |
| 184 | 147 | dictionary_path: config/dictionaries/product_title_exclusion.tsv |
| 185 | - | |
| 186 | - # 动态多语言检索字段配置 | |
| 187 | - # multilingual_fields 会被拼成 title.{lang}/brief.{lang}/... 形式; | |
| 188 | - # shared_fields 为无语言后缀字段。 | |
| 189 | 148 | search_fields: |
| 149 | + # 统一按“字段基名”配置;查询时按实际检索语言动态拼接 .{lang} | |
| 190 | 150 | multilingual_fields: |
| 191 | 151 | - title |
| 192 | 152 | - keywords |
| 193 | 153 | - qanchors |
| 194 | 154 | - enriched_tags |
| 195 | 155 | - enriched_attributes.value |
| 156 | + # - enriched_taxonomy_attributes.value | |
| 196 | 157 | - option1_values |
| 197 | 158 | - option2_values |
| 198 | 159 | - option3_values |
| ... | ... | @@ -202,13 +163,14 @@ query_config: |
| 202 | 163 | # - description |
| 203 | 164 | # - vendor |
| 204 | 165 | # shared_fields: 无语言后缀字段;示例: tags, option1_values, option2_values, option3_values |
| 166 | + | |
| 205 | 167 | shared_fields: null |
| 206 | 168 | core_multilingual_fields: |
| 207 | 169 | - title |
| 208 | 170 | - qanchors |
| 209 | 171 | - category_name_text |
| 210 | 172 | |
| 211 | - # 统一文本召回策略(主查询 + 翻译查询) | |
| 173 | + # 文本召回(主查询 + 翻译查询) | |
| 212 | 174 | text_query_strategy: |
| 213 | 175 | base_minimum_should_match: 60% |
| 214 | 176 | translation_minimum_should_match: 60% |
| ... | ... | @@ -223,14 +185,10 @@ query_config: |
| 223 | 185 | title: 5.0 |
| 224 | 186 | qanchors: 4.0 |
| 225 | 187 | phrase_match_boost: 3.0 |
| 226 | - | |
| 227 | - # Embedding字段名称 | |
| 228 | 188 | text_embedding_field: title_embedding |
| 229 | 189 | image_embedding_field: image_embedding.vector |
| 230 | 190 | |
| 231 | - # 返回字段配置(_source includes) | |
| 232 | - # null表示返回所有字段,[]表示不返回任何字段,列表表示只返回指定字段 | |
| 233 | - # 下列字段与 api/result_formatter.py(SpuResult 填充)及 search/searcher.py(SKU 排序/主图替换)一致 | |
| 191 | + # null表示返回所有字段,[]表示不返回任何字段 | |
| 234 | 192 | source_fields: |
| 235 | 193 | - spu_id |
| 236 | 194 | - handle |
| ... | ... | @@ -251,6 +209,8 @@ query_config: |
| 251 | 209 | # - qanchors |
| 252 | 210 | # - enriched_tags |
| 253 | 211 | # - enriched_attributes |
| 212 | + # - # enriched_taxonomy_attributes.value | |
| 213 | + | |
| 254 | 214 | - min_price |
| 255 | 215 | - compare_at_price |
| 256 | 216 | - image_url |
| ... | ... | @@ -270,26 +230,21 @@ query_config: |
| 270 | 230 | # KNN:文本向量与多模态(图片)向量各自 boost 与召回(k / num_candidates) |
| 271 | 231 | knn_text_boost: 4 |
| 272 | 232 | knn_image_boost: 4 |
| 273 | - | |
| 274 | - # knn_text_num_candidates = k * 3.4 | |
| 275 | 233 | knn_text_k: 160 |
| 276 | - knn_text_num_candidates: 560 | |
| 234 | + knn_text_num_candidates: 560 # k * 3.4 | |
| 277 | 235 | knn_text_k_long: 400 |
| 278 | 236 | knn_text_num_candidates_long: 1200 |
| 279 | 237 | knn_image_k: 400 |
| 280 | 238 | knn_image_num_candidates: 1200 |
| 281 | 239 | |
| 282 | -# Function Score配置(ES层打分规则) | |
| 283 | 240 | function_score: |
| 284 | 241 | score_mode: sum |
| 285 | 242 | boost_mode: multiply |
| 286 | 243 | functions: [] |
| 287 | - | |
| 288 | -# 粗排配置(仅融合 ES 文本/向量信号,不调用模型) | |
| 289 | 244 | coarse_rank: |
| 290 | 245 | enabled: true |
| 291 | - input_window: 700 | |
| 292 | - output_window: 240 | |
| 246 | + input_window: 480 | |
| 247 | + output_window: 160 | |
| 293 | 248 | fusion: |
| 294 | 249 | es_bias: 10.0 |
| 295 | 250 | es_exponent: 0.05 |
| ... | ... | @@ -301,30 +256,29 @@ coarse_rank: |
| 301 | 256 | knn_text_weight: 1.0 |
| 302 | 257 | knn_image_weight: 2.0 |
| 303 | 258 | knn_tie_breaker: 0.3 |
| 304 | - knn_bias: 0.6 | |
| 305 | - knn_exponent: 0.4 | |
| 306 | - | |
| 307 | -# 精排配置(轻量 reranker) | |
| 259 | + knn_bias: 0.0 | |
| 260 | + knn_exponent: 5.6 | |
| 261 | + knn_text_exponent: 0.0 | |
| 262 | + knn_image_exponent: 0.0 | |
| 308 | 263 | fine_rank: |
| 309 | - enabled: false | |
| 264 | + enabled: false # false 时保序透传 | |
| 310 | 265 | input_window: 160 |
| 311 | 266 | output_window: 80 |
| 312 | 267 | timeout_sec: 10.0 |
| 313 | 268 | rerank_query_template: '{query}' |
| 314 | 269 | rerank_doc_template: '{title}' |
| 315 | 270 | service_profile: fine |
| 316 | - | |
| 317 | -# 重排配置(provider/URL 在 services.rerank) | |
| 318 | 271 | rerank: |
| 319 | - enabled: true | |
| 272 | + enabled: false # false 时保序透传 | |
| 320 | 273 | rerank_window: 160 |
| 274 | + exact_knn_rescore_enabled: true | |
| 275 | + exact_knn_rescore_window: 160 | |
| 321 | 276 | timeout_sec: 15.0 |
| 322 | 277 | weight_es: 0.4 |
| 323 | 278 | weight_ai: 0.6 |
| 324 | 279 | rerank_query_template: '{query}' |
| 325 | 280 | rerank_doc_template: '{title}' |
| 326 | 281 | service_profile: default |
| 327 | - | |
| 328 | 282 | # 乘法融合:fused = Π (max(score,0) + bias) ** exponent(es / rerank / fine / text / knn) |
| 329 | 283 | # 其中 knn_score 先做一层 dis_max: |
| 330 | 284 | # max(knn_text_weight * text_knn, knn_image_weight * image_knn) |
| ... | ... | @@ -337,30 +291,28 @@ rerank: |
| 337 | 291 | fine_bias: 0.1 |
| 338 | 292 | fine_exponent: 1.0 |
| 339 | 293 | text_bias: 0.1 |
| 340 | - text_exponent: 0.25 | |
| 341 | 294 | # base_query_trans_* 相对 base_query 的权重(见 search/rerank_client 中文本 dismax 融合) |
| 295 | + text_exponent: 0.25 | |
| 342 | 296 | text_translation_weight: 0.8 |
| 343 | 297 | knn_text_weight: 1.0 |
| 344 | 298 | knn_image_weight: 2.0 |
| 345 | 299 | knn_tie_breaker: 0.3 |
| 346 | - knn_bias: 0.6 | |
| 347 | - knn_exponent: 0.4 | |
| 300 | + knn_bias: 0.0 | |
| 301 | + knn_exponent: 5.6 | |
| 348 | 302 | |
| 349 | -# 可扩展服务/provider 注册表(单一配置源) | |
| 350 | 303 | services: |
| 351 | 304 | translation: |
| 352 | 305 | service_url: http://127.0.0.1:6006 |
| 353 | - # default_model: nllb-200-distilled-600m | |
| 354 | 306 | default_model: nllb-200-distilled-600m |
| 355 | 307 | default_scene: general |
| 356 | 308 | timeout_sec: 10.0 |
| 357 | 309 | cache: |
| 358 | 310 | ttl_seconds: 62208000 |
| 359 | 311 | sliding_expiration: true |
| 360 | - # When false, cache keys are exact-match per request model only (ignores model_quality_tiers for lookups). | |
| 361 | - enable_model_quality_tier_cache: true | |
| 312 | + # When false, cache keys are exact-match per request model only (ignores model_quality_tiers for lookups) | |
| 362 | 313 | # Higher tier = better quality. Multiple models may share one tier (同级). |
| 363 | 314 | # A request may reuse Redis keys from models with tier > A or tier == A (not from lower tiers). |
| 315 | + enable_model_quality_tier_cache: true | |
| 364 | 316 | model_quality_tiers: |
| 365 | 317 | deepl: 30 |
| 366 | 318 | qwen-mt: 30 |
| ... | ... | @@ -454,13 +406,12 @@ services: |
| 454 | 406 | num_beams: 1 |
| 455 | 407 | use_cache: true |
| 456 | 408 | embedding: |
| 457 | - provider: http # http | |
| 409 | + provider: http | |
| 458 | 410 | providers: |
| 459 | 411 | http: |
| 460 | 412 | text_base_url: http://127.0.0.1:6005 |
| 461 | 413 | image_base_url: http://127.0.0.1:6008 |
| 462 | - # 服务内文本后端(embedding 进程启动时读取) | |
| 463 | - backend: tei # tei | local_st | |
| 414 | + backend: tei | |
| 464 | 415 | backends: |
| 465 | 416 | tei: |
| 466 | 417 | base_url: http://127.0.0.1:8080 |
| ... | ... | @@ -500,13 +451,13 @@ services: |
| 500 | 451 | request: |
| 501 | 452 | max_docs: 1000 |
| 502 | 453 | normalize: true |
| 503 | - default_instance: default | |
| 504 | 454 | # 命名实例:同一套 reranker 代码按实例名读取不同端口 / 后端 / runtime 目录。 |
| 455 | + default_instance: default | |
| 505 | 456 | instances: |
| 506 | 457 | default: |
| 507 | 458 | host: 0.0.0.0 |
| 508 | 459 | port: 6007 |
| 509 | - backend: qwen3_vllm_score | |
| 460 | + backend: bge | |
| 510 | 461 | runtime_dir: ./.runtime/reranker/default |
| 511 | 462 | fine: |
| 512 | 463 | host: 0.0.0.0 |
| ... | ... | @@ -543,6 +494,7 @@ services: |
| 543 | 494 | enforce_eager: false |
| 544 | 495 | infer_batch_size: 100 |
| 545 | 496 | sort_by_doc_length: true |
| 497 | + | |
| 546 | 498 | # standard=_format_instruction__standard(固定 yes/no system);compact=_format_instruction(instruction 作 system 且 user 内重复 Instruct) |
| 547 | 499 | instruction_format: standard # compact standard |
| 548 | 500 | # instruction: "Given a query, score the product for relevance" |
| ... | ... | @@ -556,6 +508,7 @@ services: |
| 556 | 508 | # instruction: "Rank products by query with category & style match prioritized" |
| 557 | 509 | # instruction: "Given a fashion shopping query, retrieve relevant products that answer the query" |
| 558 | 510 | instruction: rank products by given query |
| 511 | + | |
| 559 | 512 | # vLLM LLM.score()(跨编码打分)。独立高性能环境 .venv-reranker-score(vllm 0.18 固定版):./scripts/setup_reranker_venv.sh qwen3_vllm_score |
| 560 | 513 | # 与 qwen3_vllm 可共用同一 model_name / HF 缓存;venv 分离以便升级 vLLM 而不影响 generate 后端。 |
| 561 | 514 | qwen3_vllm_score: |
| ... | ... | @@ -583,15 +536,10 @@ services: |
| 583 | 536 | qwen3_transformers: |
| 584 | 537 | model_name: Qwen/Qwen3-Reranker-0.6B |
| 585 | 538 | instruction: rank products by given query |
| 586 | - # instruction: "Score the product’s relevance to the given query" | |
| 587 | 539 | max_length: 8192 |
| 588 | 540 | batch_size: 64 |
| 589 | 541 | use_fp16: true |
| 590 | - # sdpa:默认无需 flash-attn;若已安装 flash_attn 可改为 flash_attention_2 | |
| 591 | 542 | attn_implementation: sdpa |
| 592 | - # Packed Transformers backend: shared query prefix + custom position_ids/attention_mask. | |
| 593 | - # For 1 query + many short docs (for example 400 product titles), this usually reduces | |
| 594 | - # repeated prefix work and padding waste compared with pairwise batching. | |
| 595 | 543 | qwen3_transformers_packed: |
| 596 | 544 | model_name: Qwen/Qwen3-Reranker-0.6B |
| 597 | 545 | instruction: Rank products by query with category & style match prioritized |
| ... | ... | @@ -600,8 +548,6 @@ services: |
| 600 | 548 | max_docs_per_pack: 0 |
| 601 | 549 | use_fp16: true |
| 602 | 550 | sort_by_doc_length: true |
| 603 | - # Packed mode relies on a custom 4D attention mask. "eager" is the safest default. | |
| 604 | - # If your torch/transformers stack validates it, you can benchmark "sdpa". | |
| 605 | 551 | attn_implementation: eager |
| 606 | 552 | qwen3_gguf: |
| 607 | 553 | repo_id: DevQuasar/Qwen.Qwen3-Reranker-4B-GGUF |
| ... | ... | @@ -609,7 +555,6 @@ services: |
| 609 | 555 | cache_dir: ./model_cache |
| 610 | 556 | local_dir: ./models/reranker/qwen3-reranker-4b-gguf |
| 611 | 557 | instruction: Rank products by query with category & style match prioritized |
| 612 | - # T4 16GB / 性能优先配置:全量层 offload,实测比保守配置明显更快 | |
| 613 | 558 | n_ctx: 512 |
| 614 | 559 | n_batch: 512 |
| 615 | 560 | n_ubatch: 512 |
| ... | ... | @@ -632,8 +577,6 @@ services: |
| 632 | 577 | cache_dir: ./model_cache |
| 633 | 578 | local_dir: ./models/reranker/qwen3-reranker-0.6b-q8_0-gguf |
| 634 | 579 | instruction: Rank products by query with category & style match prioritized |
| 635 | - # 0.6B GGUF / online rerank baseline: | |
| 636 | - # 实测 400 titles 单请求约 265s,因此它更适合作为低显存功能后备,不适合在线低延迟主路由。 | |
| 637 | 580 | n_ctx: 256 |
| 638 | 581 | n_batch: 256 |
| 639 | 582 | n_ubatch: 256 |
| ... | ... | @@ -653,20 +596,15 @@ services: |
| 653 | 596 | verbose: false |
| 654 | 597 | dashscope_rerank: |
| 655 | 598 | model_name: qwen3-rerank |
| 656 | - # 按地域选择 endpoint: | |
| 657 | - # 中国: https://dashscope.aliyuncs.com/compatible-api/v1/reranks | |
| 658 | - # 新加坡: https://dashscope-intl.aliyuncs.com/compatible-api/v1/reranks | |
| 659 | - # 美国: https://dashscope-us.aliyuncs.com/compatible-api/v1/reranks | |
| 660 | 599 | endpoint: https://dashscope.aliyuncs.com/compatible-api/v1/reranks |
| 661 | 600 | api_key_env: RERANK_DASHSCOPE_API_KEY_CN |
| 662 | 601 | timeout_sec: 10.0 |
| 663 | - top_n_cap: 0 # 0 表示 top_n=当前请求文档数;>0 则限制 top_n 上限 | |
| 664 | - batchsize: 64 # 0 关闭;>0 启用并发小包调度(top_n/top_n_cap 仍生效,分包后全局截断) | |
| 602 | + top_n_cap: 0 # 0 表示 top_n=当前请求文档数 | |
| 603 | + batchsize: 64 # 0 关闭;>0 启用并发小包调度(top_n/top_n_cap 仍生效,分包后全局截断) | |
| 665 | 604 | instruct: Given a shopping query, rank product titles by relevance |
| 666 | 605 | max_retries: 2 |
| 667 | 606 | retry_backoff_sec: 0.2 |
| 668 | 607 | |
| 669 | -# SPU配置(已启用,使用嵌套skus) | |
| 670 | 608 | spu_config: |
| 671 | 609 | enabled: true |
| 672 | 610 | spu_field: spu_id |
| ... | ... | @@ -678,7 +616,6 @@ spu_config: |
| 678 | 616 | - option2 |
| 679 | 617 | - option3 |
| 680 | 618 | |
| 681 | -# 租户配置(Tenant Configuration) | |
| 682 | 619 | # 每个租户可配置主语言 primary_language 与索引语言 index_languages(主市场语言,商家可勾选) |
| 683 | 620 | # 默认 index_languages: [en, zh],可配置为任意 SOURCE_LANG_CODE_MAP.keys() 的子集 |
| 684 | 621 | tenant_config: | ... | ... |
config/loader.py
| ... | ... | @@ -587,6 +587,14 @@ class AppConfigLoader: |
| 587 | 587 | knn_tie_breaker=float(coarse_fusion_raw.get("knn_tie_breaker", 0.0)), |
| 588 | 588 | knn_bias=float(coarse_fusion_raw.get("knn_bias", 0.6)), |
| 589 | 589 | knn_exponent=float(coarse_fusion_raw.get("knn_exponent", 0.2)), |
| 590 | + knn_text_bias=float( | |
| 591 | + coarse_fusion_raw.get("knn_text_bias", coarse_fusion_raw.get("knn_bias", 0.6)) | |
| 592 | + ), | |
| 593 | + knn_text_exponent=float(coarse_fusion_raw.get("knn_text_exponent", 0.0)), | |
| 594 | + knn_image_bias=float( | |
| 595 | + coarse_fusion_raw.get("knn_image_bias", coarse_fusion_raw.get("knn_bias", 0.6)) | |
| 596 | + ), | |
| 597 | + knn_image_exponent=float(coarse_fusion_raw.get("knn_image_exponent", 0.0)), | |
| 590 | 598 | text_translation_weight=float( |
| 591 | 599 | coarse_fusion_raw.get("text_translation_weight", 0.8) |
| 592 | 600 | ), |
| ... | ... | @@ -608,6 +616,12 @@ class AppConfigLoader: |
| 608 | 616 | rerank=RerankConfig( |
| 609 | 617 | enabled=bool(rerank_cfg.get("enabled", True)), |
| 610 | 618 | rerank_window=int(rerank_cfg.get("rerank_window", 384)), |
| 619 | + exact_knn_rescore_enabled=bool( | |
| 620 | + rerank_cfg.get("exact_knn_rescore_enabled", False) | |
| 621 | + ), | |
| 622 | + exact_knn_rescore_window=int( | |
| 623 | + rerank_cfg.get("exact_knn_rescore_window", 0) | |
| 624 | + ), | |
| 611 | 625 | timeout_sec=float(rerank_cfg.get("timeout_sec", 15.0)), |
| 612 | 626 | weight_es=float(rerank_cfg.get("weight_es", 0.4)), |
| 613 | 627 | weight_ai=float(rerank_cfg.get("weight_ai", 0.6)), |
| ... | ... | @@ -630,6 +644,14 @@ class AppConfigLoader: |
| 630 | 644 | knn_tie_breaker=float(fusion_raw.get("knn_tie_breaker", 0.0)), |
| 631 | 645 | knn_bias=float(fusion_raw.get("knn_bias", 0.6)), |
| 632 | 646 | knn_exponent=float(fusion_raw.get("knn_exponent", 0.2)), |
| 647 | + knn_text_bias=float( | |
| 648 | + fusion_raw.get("knn_text_bias", fusion_raw.get("knn_bias", 0.6)) | |
| 649 | + ), | |
| 650 | + knn_text_exponent=float(fusion_raw.get("knn_text_exponent", 0.0)), | |
| 651 | + knn_image_bias=float( | |
| 652 | + fusion_raw.get("knn_image_bias", fusion_raw.get("knn_bias", 0.6)) | |
| 653 | + ), | |
| 654 | + knn_image_exponent=float(fusion_raw.get("knn_image_exponent", 0.0)), | |
| 633 | 655 | fine_bias=float(fusion_raw.get("fine_bias", 0.00001)), |
| 634 | 656 | fine_exponent=float(fusion_raw.get("fine_exponent", 1.0)), |
| 635 | 657 | text_translation_weight=float( |
| ... | ... | @@ -655,6 +677,14 @@ class AppConfigLoader: |
| 655 | 677 | |
| 656 | 678 | translation_raw = raw.get("translation") if isinstance(raw.get("translation"), dict) else {} |
| 657 | 679 | normalized_translation = build_translation_config(translation_raw) |
| 680 | + local_translation_backends = {"local_nllb", "local_marian"} | |
| 681 | + for capability_name, capability_cfg in normalized_translation["capabilities"].items(): | |
| 682 | + backend_name = str(capability_cfg.get("backend") or "").strip().lower() | |
| 683 | + if backend_name not in local_translation_backends: | |
| 684 | + continue | |
| 685 | + for path_key in ("model_dir", "ct2_model_dir"): | |
| 686 | + if capability_cfg.get(path_key) not in (None, ""): | |
| 687 | + capability_cfg[path_key] = str(self._resolve_project_path_value(capability_cfg[path_key]).resolve()) | |
| 658 | 688 | translation_config = TranslationServiceConfig( |
| 659 | 689 | endpoint=str(normalized_translation["service_url"]).rstrip("/"), |
| 660 | 690 | timeout_sec=float(normalized_translation["timeout_sec"]), |
| ... | ... | @@ -749,7 +779,7 @@ class AppConfigLoader: |
| 749 | 779 | port=port, |
| 750 | 780 | backend=backend_name, |
| 751 | 781 | runtime_dir=( |
| 752 | - str(v) | |
| 782 | + str(self._resolve_project_path_value(v).resolve()) | |
| 753 | 783 | if (v := instance_raw.get("runtime_dir")) not in (None, "") |
| 754 | 784 | else None |
| 755 | 785 | ), |
| ... | ... | @@ -787,6 +817,12 @@ class AppConfigLoader: |
| 787 | 817 | rerank=rerank_config, |
| 788 | 818 | ) |
| 789 | 819 | |
| 820 | + def _resolve_project_path_value(self, value: Any) -> Path: | |
| 821 | + candidate = Path(str(value)).expanduser() | |
| 822 | + if candidate.is_absolute(): | |
| 823 | + return candidate | |
| 824 | + return self.project_root / candidate | |
| 825 | + | |
| 790 | 826 | def _build_tenants_config(self, raw: Dict[str, Any]) -> TenantCatalogConfig: |
| 791 | 827 | if not isinstance(raw, dict): |
| 792 | 828 | raise ConfigurationError("tenant_config must be a mapping") | ... | ... |
config/schema.py
| ... | ... | @@ -119,6 +119,18 @@ class RerankFusionConfig: |
| 119 | 119 | knn_tie_breaker: float = 0.0 |
| 120 | 120 | knn_bias: float = 0.6 |
| 121 | 121 | knn_exponent: float = 0.2 |
| 122 | + #: Optional additive floor for the weighted text KNN term. | |
| 123 | + #: Falls back to knn_bias when omitted in config loading. | |
| 124 | + knn_text_bias: float = 0.6 | |
| 125 | + #: Optional extra multiplicative term on weighted text KNN. | |
| 126 | + #: Uses knn_text_bias as the additive floor. | |
| 127 | + knn_text_exponent: float = 0.0 | |
| 128 | + #: Optional additive floor for the weighted image KNN term. | |
| 129 | + #: Falls back to knn_bias when omitted in config loading. | |
| 130 | + knn_image_bias: float = 0.6 | |
| 131 | + #: Optional extra multiplicative term on weighted image KNN. | |
| 132 | + #: Uses knn_image_bias as the additive floor. | |
| 133 | + knn_image_exponent: float = 0.0 | |
| 122 | 134 | fine_bias: float = 0.00001 |
| 123 | 135 | fine_exponent: float = 1.0 |
| 124 | 136 | #: 翻译子句 named query 分数相对原文 base_query 的权重(加权后再与原文做 dismax 融合) |
| ... | ... | @@ -143,6 +155,18 @@ class CoarseRankFusionConfig: |
| 143 | 155 | knn_tie_breaker: float = 0.0 |
| 144 | 156 | knn_bias: float = 0.6 |
| 145 | 157 | knn_exponent: float = 0.2 |
| 158 | + #: Optional additive floor for the weighted text KNN term. | |
| 159 | + #: Falls back to knn_bias when omitted in config loading. | |
| 160 | + knn_text_bias: float = 0.6 | |
| 161 | + #: Optional extra multiplicative term on weighted text KNN. | |
| 162 | + #: Uses knn_text_bias as the additive floor. | |
| 163 | + knn_text_exponent: float = 0.0 | |
| 164 | + #: Optional additive floor for the weighted image KNN term. | |
| 165 | + #: Falls back to knn_bias when omitted in config loading. | |
| 166 | + knn_image_bias: float = 0.6 | |
| 167 | + #: Optional extra multiplicative term on weighted image KNN. | |
| 168 | + #: Uses knn_image_bias as the additive floor. | |
| 169 | + knn_image_exponent: float = 0.0 | |
| 146 | 170 | #: 翻译子句 named query 分数相对原文 base_query 的权重(加权后再与原文做 dismax 融合) |
| 147 | 171 | text_translation_weight: float = 0.8 |
| 148 | 172 | |
| ... | ... | @@ -176,6 +200,9 @@ class RerankConfig: |
| 176 | 200 | |
| 177 | 201 | enabled: bool = True |
| 178 | 202 | rerank_window: int = 384 |
| 203 | + exact_knn_rescore_enabled: bool = False | |
| 204 | + #: topN exact vector scoring window; <=0 means "follow rerank_window" | |
| 205 | + exact_knn_rescore_window: int = 0 | |
| 179 | 206 | timeout_sec: float = 15.0 |
| 180 | 207 | weight_es: float = 0.4 |
| 181 | 208 | weight_ai: float = 0.6 | ... | ... |
docs/DEVELOPER_GUIDE.md
| ... | ... | @@ -389,7 +389,7 @@ services: |
| 389 | 389 | - **位置**:`tests/`,可按 `unit/`、`integration/` 或按模块划分子目录;公共 fixture 在 `conftest.py`。 |
| 390 | 390 | - **标记**:使用 `@pytest.mark.unit`、`@pytest.mark.integration`、`@pytest.mark.api` 等区分用例类型,便于按需运行。 |
| 391 | 391 | - **依赖**:单元测试通过 mock(如 `mock_es_client`、`sample_search_config`)不依赖真实 ES/DB;集成测试需在说明中注明依赖服务。 |
| 392 | -- **运行**:`python -m pytest tests/`;仅单元:`python -m pytest tests/unit/` 或 `-m unit`。 | |
| 392 | +- **运行**:`python -m pytest tests/`;推荐最小回归:`python -m pytest tests/ci -q`;按模块聚焦可直接指定具体测试文件。 | |
| 393 | 393 | - **原则**:新增逻辑应有对应测试;修改协议或配置契约时更新相关测试与 fixture。 |
| 394 | 394 | |
| 395 | 395 | ### 8.3 配置与环境 | ... | ... |
docs/QUICKSTART.md
| ... | ... | @@ -69,7 +69,7 @@ source activate.sh |
| 69 | 69 | ./run.sh all |
| 70 | 70 | # 仅为薄封装:等价于 ./scripts/service_ctl.sh up all |
| 71 | 71 | # 说明: |
| 72 | -# - all = tei cnclip embedding embedding-image translator reranker reranker-fine backend indexer frontend eval-web | |
| 72 | +# - all = tei cnclip embedding embedding-image translator reranker backend indexer frontend eval-web | |
| 73 | 73 | # - up 会同时启动 monitor daemon(运行期连续失败自动重启) |
| 74 | 74 | # - reranker 为 GPU 强制模式(资源不足会直接启动失败) |
| 75 | 75 | # - TEI 默认使用 GPU;当 TEI_DEVICE=cuda 且 GPU 不可用时会直接失败(不会自动降级到 CPU) |
| ... | ... | @@ -166,7 +166,7 @@ curl -X POST http://localhost:6008/embed/image \ |
| 166 | 166 | |
| 167 | 167 | ```bash |
| 168 | 168 | ./scripts/setup_translator_venv.sh |
| 169 | -./.venv-translator/bin/python scripts/download_translation_models.py --all-local # 如需本地模型 | |
| 169 | +./.venv-translator/bin/python scripts/translation/download_translation_models.py --all-local # 如需本地模型 | |
| 170 | 170 | ./scripts/start_translator.sh |
| 171 | 171 | |
| 172 | 172 | curl -X POST http://localhost:6006/translate \ | ... | ... |
docs/Usage-Guide.md
| ... | ... | @@ -126,7 +126,7 @@ cd /data/saas-search |
| 126 | 126 | |
| 127 | 127 | 这个脚本会自动: |
| 128 | 128 | 1. 创建日志目录 |
| 129 | -2. 按目标启动服务(`all`:`tei cnclip embedding embedding-image translator reranker reranker-fine backend indexer frontend eval-web`) | |
| 129 | +2. 按目标启动服务(`all`:`tei cnclip embedding embedding-image translator reranker backend indexer frontend eval-web`) | |
| 130 | 130 | 3. 写入 PID 到 `logs/*.pid` |
| 131 | 131 | 4. 执行健康检查 |
| 132 | 132 | 5. 启动 monitor daemon(运行期连续失败自动重启) |
| ... | ... | @@ -202,7 +202,7 @@ python -m pytest -q tests/test_rerank_client.py tests/test_es_query_builder.py t |
| 202 | 202 | ./scripts/service_ctl.sh restart backend |
| 203 | 203 | sleep 3 |
| 204 | 204 | ./scripts/service_ctl.sh status backend |
| 205 | -./scripts/evaluation/start_eval.sh.sh batch | |
| 205 | +./scripts/evaluation/start_eval.sh batch | |
| 206 | 206 | ``` |
| 207 | 207 | |
| 208 | 208 | 离线批量评估会把标注与报表写到 `artifacts/search_evaluation/`(SQLite、`batch_reports/` 下的 JSON/Markdown 等)。说明与命令见 [scripts/evaluation/README.md](../scripts/evaluation/README.md)。 | ... | ... |
| ... | ... | @@ -0,0 +1,133 @@ |
| 1 | +# 本项目缓存一览 | |
| 2 | + | |
| 3 | +本文档梳理仓库内**与业务相关的各类缓存**:说明用途、键与过期策略,并汇总运维脚本。按「分布式(Redis)→ 进程内 → 磁盘/模型 → 第三方」组织。 | |
| 4 | + | |
| 5 | +--- | |
| 6 | + | |
| 7 | +## 一、Redis 集中式缓存(生产主路径) | |
| 8 | + | |
| 9 | +所有下列缓存默认连接 **`infrastructure.redis`**(`config/config.yaml` 与 `REDIS_*` 环境变量),**数据库编号一般为 `db=0`**(脚本可通过参数覆盖)。`snapshot_db` 仅在配置中存在,供快照/运维场景选用,应用代码未按该字段切换业务缓存的 DB。 | |
| 10 | + | |
| 11 | +### 1. 文本 / 图像向量缓存(Embedding) | |
| 12 | + | |
| 13 | +- **作用**:缓存 BGE/TEI 文本向量与 CN-CLIP 图像向量、CLIP 文本塔向量,避免重复推理。 | |
| 14 | +- **实现**:`embeddings/redis_embedding_cache.py` 的 `RedisEmbeddingCache`;键构造见 `embeddings/cache_keys.py`。 | |
| 15 | +- **Key 形态**(最终 Redis 键 = `前缀` + `可选 namespace` + `逻辑键`): | |
| 16 | + - **前缀**:`infrastructure.redis.embedding_cache_prefix`(默认 `embedding`,可用 `REDIS_EMBEDDING_CACHE_PREFIX` 覆盖)。 | |
| 17 | + - **命名空间**:`embeddings/server.py` 与客户端中分为: | |
| 18 | + - 文本:`namespace=""` → `{prefix}:{embed:norm0|1:...}` | |
| 19 | + - 图像:`namespace="image"` → `{prefix}:image:{embed:模型名:txt:norm0|1:...}` | |
| 20 | + - CLIP 文本:`namespace="clip_text"` → `{prefix}:clip_text:{embed:模型名:img:norm0|1:...}` | |
| 21 | + - 逻辑键段含 `embed:`、`norm0/1`、模型名(多模态)、过长文本/URL 时用 `h:sha256:...` 摘要(见 `cache_keys.py` 注释)。 | |
| 22 | +- **值格式**:BF16 压缩后的字节(`embeddings/bf16.py`),非 JSON。 | |
| 23 | +- **TTL**:`infrastructure.redis.cache_expire_days`(默认 **720 天**,`REDIS_CACHE_EXPIRE_DAYS`)。写入用 `SETEX`;**命中时滑动续期**(`EXPIRE` 刷新为同一时长)。 | |
| 24 | +- **Redis 客户端**:`decode_responses=False`(二进制)。 | |
| 25 | + | |
| 26 | +**主要代码**:`embeddings/server.py`、`embeddings/text_encoder.py`、`embeddings/image_encoder.py`。 | |
| 27 | + | |
| 28 | +--- | |
| 29 | + | |
| 30 | +### 2. 翻译结果缓存(Translation) | |
| 31 | + | |
| 32 | +- **作用**:按「翻译模型 + 目标语言 + 原文」缓存译文;支持**模型质量分层探测**(高 tier 模型写入的缓存可被同 tier 或更高 tier 的请求命中,见 `translation/settings.py` 中 `translation_cache_probe_models`)。 | |
| 33 | +- **Key 形态**:`trans:{model}:{target_lang}:{text前4字符}{sha256全文}`(`translation/cache.py` 的 `build_key`)。 | |
| 34 | +- **值格式**:UTF-8 译文字符串。 | |
| 35 | +- **TTL**:`services.translation.cache.ttl_seconds`(默认 **62208000 秒 = 720 天**)。若 `sliding_expiration: true`,命中时刷新 TTL。 | |
| 36 | +- **能力级开关**:各 `capabilities.*.use_cache` 为 `false` 时该后端不落 Redis。 | |
| 37 | +- **Redis 客户端**:`decode_responses=True`。 | |
| 38 | + | |
| 39 | +**主要代码**:`translation/cache.py`、`translation/service.py`;翻译 HTTP 服务:`api/translator_app.py`(`get_translation_service()` 使用 `lru_cache` 单例,见下文进程内缓存)。 | |
| 40 | + | |
| 41 | +--- | |
| 42 | + | |
| 43 | +### 3. 商品内容理解 / Anchors 与语义分析缓存(Indexer) | |
| 44 | + | |
| 45 | +- **作用**:缓存 LLM 对商品标题等拼出的 **prompt 输入** 所做的分析结果(anchors、语义属性等),避免重复调用大模型。键与 `analysis_kind`、`prompt` 契约版本、`target_lang` 及输入摘要相关。 | |
| 46 | +- **Key 形态**:`{anchor_cache_prefix}:{analysis_kind}:{prompt_contract_hash[:12]}:{target_lang}:{prompt_input[:4]}{md5}`(`indexer/product_enrich.py` 中 `_make_analysis_cache_key`)。 | |
| 47 | +- **前缀**:`infrastructure.redis.anchor_cache_prefix`(默认 `product_anchors`,`REDIS_ANCHOR_CACHE_PREFIX`)。 | |
| 48 | +- **值格式**:JSON 字符串(规范化后的分析结果)。 | |
| 49 | +- **TTL**:`anchor_cache_expire_days`(默认 **30 天**),以秒写入 `SETEX`(**非滑动**,与向量/翻译不同)。 | |
| 50 | +- **读逻辑**:无 TTL 刷新;仅校验内容是否「有意义」再返回。 | |
| 51 | + | |
| 52 | +**主要代码**:`indexer/product_enrich.py`;与 HTTP 侧对齐说明见 `api/routes/indexer.py` 注释。 | |
| 53 | + | |
| 54 | +--- | |
| 55 | + | |
| 56 | +## 二、进程内缓存(非共享、随进程重启失效) | |
| 57 | + | |
| 58 | +| 名称 | 用途 | 范围/生命周期 | | |
| 59 | +|------|------|----------------| | |
| 60 | +| **`get_app_config()`** | 解析并缓存全局 `AppConfig` | `config/loader.py`:`@lru_cache(maxsize=1)`;`reload_app_config()` 可 `cache_clear()` | | |
| 61 | +| **`TranslationService` 单例** | 翻译服务进程内复用后端与 Redis 客户端 | `api/translator_app.py`:`get_translation_service()` | | |
| 62 | +| **`_nllb_tokenizer_code_by_normalized_key`** | NLLB tokenizer 语言码映射 | `translation/languages.py`:`@lru_cache(maxsize=1)` | | |
| 63 | +| **`QueryTextAnalysisCache`** | 单次查询解析内复用分词、tokenizer 结果 | `query/tokenization.py`,随 `QueryParser` 一次 parse | | |
| 64 | +| **`_SelectionContext`(SKU 意图)** | 归一化文本、分词、匹配布尔等小字典 | `search/sku_intent_selector.py`,单次选择流程 | | |
| 65 | +| **`incremental_service` transformer 缓存** | 按 `tenant_id` 缓存文档转换器 | `indexer/incremental_service.py`,**无界**、多租户进程长期存活时需注意内存 | | |
| 66 | +| **NLLB batch 内 `token_count_cache`** | 同一 batch 内避免重复计 token | `translation/backends/local_ctranslate2.py` | | |
| 67 | +| **CLIP 分词器 `@lru_cache`**(第三方) | 简单 tokenizer 缓存 | `third-party/clip-as-service/.../simple_tokenizer.py` | | |
| 68 | + | |
| 69 | +**说明**:`utils/cache.py` 中的 **`DictCache`**(文件 JSON:默认 `.cache/dict_cache.json`)已导出,但仓库内**无直接 `DictCache(` 调用**,视为可复用工具/预留,非当前主路径。 | |
| 70 | + | |
| 71 | +--- | |
| 72 | + | |
| 73 | +## 三、磁盘与模型相关「缓存」(非 Redis) | |
| 74 | + | |
| 75 | +| 名称 | 用途 | 配置/位置 | | |
| 76 | +|------|------|-----------| | |
| 77 | +| **Hugging Face / 本地模型目录** | 重排器、翻译本地模型等权重下载与缓存 | `services.rerank.backends.*.cache_dir` 等,常见默认 **`./model_cache`**(`config/config.yaml`) | | |
| 78 | +| **vLLM `enable_prefix_caching`** | 重排服务内 **Prefix KV 缓存**(加速同前缀批推理) | `services.rerank.backends.qwen3_vllm*`、`reranker/backends/qwen3_vllm*.py` | | |
| 79 | +| **运行时目录** | 重排服务状态/引擎文件 | `services.rerank.instances.*.runtime_dir`(如 `./.runtime/reranker/...`) | | |
| 80 | + | |
| 81 | +翻译能力里的 **`use_cache: true`**(如 NLLB、Marian)在多数后端指 **推理时的 KV cache(Transformer)**,与 Redis 译文缓存是不同层次;Redis 译文缓存仍由 `TranslationCache` 控制。 | |
| 82 | + | |
| 83 | +--- | |
| 84 | + | |
| 85 | +## 四、Elasticsearch 内部缓存 | |
| 86 | + | |
| 87 | +索引设置中的 `refresh_interval` 等影响近实时可见性,但**不属于应用层键值缓存**。若需调优 ES 查询缓存、节点堆等,见运维文档与集群配置,此处不展开。 | |
| 88 | + | |
| 89 | +--- | |
| 90 | + | |
| 91 | +## 五、运维与巡检脚本(Redis) | |
| 92 | + | |
| 93 | +| 脚本 | 作用 | | |
| 94 | +|------|------| | |
| 95 | +| `scripts/redis/redis_cache_health_check.py` | 按 **embedding / translation / anchors** 三类前缀巡检:key 数量估算、TTL 采样、`IDLETIME` 等 | | |
| 96 | +| `scripts/redis/redis_cache_prefix_stats.py` | 按前缀统计 key 数量与 **MEMORY USAGE**(可多 DB) | | |
| 97 | +| `scripts/redis/redis_memory_heavy_keys.py` | 扫描占用内存最大的 key,辅助排查「统计与总内存不一致」 | | |
| 98 | +| `scripts/redis/monitor_eviction.py` | 实时监控 **eviction** 相关事件,用于容量与驱逐策略排查 | | |
| 99 | + | |
| 100 | +使用前需加载项目配置(如 `source activate.sh`)以保证 `REDIS_CONFIG` 与生产一致。脚本注释中给出了 **`redis-cli` 手工统计**示例(按前缀 `wc -l`、`MEMORY STATS` 等)。 | |
| 101 | + | |
| 102 | +--- | |
| 103 | + | |
| 104 | +## 六、总表(Redis 与各层缓存) | |
| 105 | + | |
| 106 | +| 缓存名称 | 业务模块 | 存储 | Key 前缀 / 命名模式 | 过期时间 | 过期策略 | 值摘要 | 配置键 / 环境变量 | | |
| 107 | +|----------|----------|------|---------------------|----------|----------|--------|-------------------| | |
| 108 | +| 文本向量 | 检索 / 索引 / Embedding 服务 | Redis db≈0 | `{embedding_cache_prefix}:*`(逻辑键以 `embed:norm…` 开头) | `cache_expire_days`(默认 720 天) | 写入 TTL + 命中滑动续期 | BF16 字节向量 | `infrastructure.redis.*`;`REDIS_EMBEDDING_CACHE_PREFIX`、`REDIS_CACHE_EXPIRE_DAYS` | | |
| 109 | +| 图像向量(CLIP 图) | 图搜 / 多模态 | 同上 | `{prefix}:image:*` | 同上 | 同上 | BF16 字节 | 同上 | | |
| 110 | +| CLIP 文本塔向量 | 图搜文本侧 | 同上 | `{prefix}:clip_text:*` | 同上 | 同上 | BF16 字节 | 同上 | | |
| 111 | +| 翻译译文 | 查询翻译、翻译服务 | 同上 | `trans:{model}:{lang}:*` | `services.translation.cache.ttl_seconds`(默认 720 天) | 可配置滑动(`sliding_expiration`) | UTF-8 字符串 | `services.translation.cache.*`;各能力 `use_cache` | | |
| 112 | +| 商品分析 / Anchors | 索引富化、LLM 内容理解 | 同上 | `{anchor_cache_prefix}:{kind}:{hash}:{lang}:*` | `anchor_cache_expire_days`(默认 30 天) | 固定 TTL,不滑动 | JSON 字符串 | `anchor_cache_prefix`、`anchor_cache_expire_days`;`REDIS_ANCHOR_*` | | |
| 113 | +| 应用配置 | 全栈 | 进程内存 | N/A(单例) | 进程生命周期 | `reload_app_config` 清除 | `AppConfig` 对象 | `config/loader.py` | | |
| 114 | +| 翻译服务实例 | 翻译 API | 进程内存 | N/A | 进程生命周期 | 单例 | `TranslationService` | `api/translator_app.py` | | |
| 115 | +| 查询分词缓存 | 查询解析 | 单次请求内 | N/A | 单次 parse | — | 分词与中间结果 | `query/tokenization.py` | | |
| 116 | +| SKU 意图辅助字典 | 搜索排序辅助 | 单次请求内 | N/A | 单次选择 | — | 小 dict | `search/sku_intent_selector.py` | | |
| 117 | +| 增量索引 Transformer | 索引管道 | 进程内存 | `tenant_id` 字符串键 | 长期(无界) | 无自动淘汰 | Transformer 元组 | `indexer/incremental_service.py` | | |
| 118 | +| 重排 / 翻译模型权重 | 推理服务 | 本地磁盘 | 目录路径 | 无自动删除(人工清理) | — | 模型文件 | `cache_dir: ./model_cache` 等 | | |
| 119 | +| vLLM Prefix 缓存 | 重排(Qwen3 等) | GPU/引擎内 | 引擎内部 | 引擎管理 | — | KV Cache | `enable_prefix_caching` | | |
| 120 | +| 文件 Dict 缓存(可选) | 通用 | `.cache/dict_cache.json` | 分类 + 自定义 key | 持久直至删除 | — | JSON 可序列化值 | `utils/cache.py`(当前无调用方) | | |
| 121 | + | |
| 122 | +--- | |
| 123 | + | |
| 124 | +## 七、维护建议(简要) | |
| 125 | + | |
| 126 | +1. **容量**:三类 Redis 缓存(embedding / trans / anchors)可共用同一实例;大租户或图搜多时 **embedding** 与 **trans** 往往占主要内存,可用 `redis_cache_prefix_stats.py` 分前缀观察。 | |
| 127 | +2. **键迁移**:变更 `embedding_cache_prefix`、CLIP `model_name` 或 prompt 契约会自然**隔离新键空间**;旧键依赖 TTL 或人工批量删除。 | |
| 128 | +3. **一致性**:向量缓存对异常向量会 **delete key**(`RedisEmbeddingCache.get`);anchors 依赖 `cache_version` 与契约 hash 防止错误复用。 | |
| 129 | +4. **监控**:除脚本外,Embedding HTTP 服务健康检查会报告各 lane 的 **`cache_enabled`**(`embeddings/server.py`)。 | |
| 130 | + | |
| 131 | +--- | |
| 132 | + | |
| 133 | +*文档随代码扫描生成;若新增 Redis 用途,请同步更新本文件与 `scripts/redis/redis_cache_health_check.py` 中的 `_load_known_cache_types()`。* | ... | ... |
docs/issues/issue-2026-04-08-eval框架主指标ERR的问题以及bm25调参-done-0408.md
0 → 100644
| ... | ... | @@ -0,0 +1,120 @@ |
| 1 | +1. 目前检索系统评测的主要指标是这几个 | |
| 2 | + "NDCG@20, NDCG@50, ERR@10, Strong_Precision@10, Strong_Precision@20, " | |
| 3 | +参考_err_at_k,计算逻辑好像没问题 | |
| 4 | +现在的问题是,ERR 指标跟其他几个指标好像经常有相反的趋势。请再分析他是否适合作为主指标之一,目前有什么问题。 | |
| 5 | + | |
| 6 | +2. 目前bm25参数是: | |
| 7 | +"b": 0.1, | |
| 8 | +"k1": 0.3 | |
| 9 | +对应的基线是 /data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260408T055948Z_00b6a8aa3d.md (Primary_Metric_Score: 0.604555 | |
| 10 | + | |
| 11 | +) | |
| 12 | + | |
| 13 | +(比之前b和k1都设置为0好了很多,之前都设置为0的情况:/data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260407T150946Z_00b6a8aa3d.md | |
| 14 | + Primary_Metric_Score: 0.602598 | |
| 15 | + | |
| 16 | +) | |
| 17 | + | |
| 18 | +这两个参数从0改为0.1/0.3的背景是: | |
| 19 | +This change adjusts the BM25 parameters used by the combined query. | |
| 20 | + | |
| 21 | +Previously, both `b` and `k1` were set to `0.0`. The original intention was to avoid two common issues in e-commerce search relevance: | |
| 22 | + | |
| 23 | +1. Over-penalizing longer product titles | |
| 24 | + In product search, a shorter title should not automatically rank higher just because BM25 favors shorter fields. For example, for a query like “遥控车”, a product whose title is simply “遥控车” is not necessarily a better candidate than a product with a slightly longer but more descriptive title. In practice, extremely short titles may even indicate lower-quality catalog data. | |
| 25 | + | |
| 26 | +2. Over-rewarding repeated occurrences of the same term | |
| 27 | + For longer queries such as “遥控喷雾翻滚多功能车玩具车”, the default BM25 behavior may give too much weight to a term that appears multiple times (for example “遥控”), even when other important query terms such as “喷雾” or “翻滚” are missing. This can cause products with repeated partial matches to outrank products that actually cover more of the user intent. | |
| 28 | + | |
| 29 | +Setting both parameters to zero was an intentional way to suppress length normalization and term-frequency amplification. However, after introducing a `combined_fields` query, this configuration becomes too aggressive. Since `combined_fields` scores multiple fields as a unified relevance signal, completely disabling both effects may also remove useful ranking information, especially when we still want documents matching more query terms across fields to be distinguishable from weaker matches. | |
| 30 | + | |
| 31 | +This update therefore relaxes the previous setting and reintroduces a controlled amount of BM25 normalization/scoring behavior. The goal is to keep the original intent — avoiding short-title bias and excessive repeated-term gain — while allowing the combined query to better preserve meaningful relevance differences across candidates. | |
| 32 | + | |
| 33 | +Expected effect: | |
| 34 | +- reduce the bias toward unnaturally short product titles | |
| 35 | +- limit score inflation caused by repeated occurrences of the same term | |
| 36 | +- improve ranking stability for `combined_fields` queries | |
| 37 | +- better reward candidates that cover more of the overall query intent, instead of those that only repeat a subset of terms | |
| 38 | + | |
| 39 | + | |
| 40 | +因为实验有效,因此帮我继续进行实验 | |
| 41 | + | |
| 42 | +请帮我再进行这四轮实验,对比效果,优化bm25参数: | |
| 43 | +{ "b": 0.10, "k1": 0.30 } | |
| 44 | +{ "b": 0.20, "k1": 0.60 } | |
| 45 | +{ "b": 0.50, "k1": 1.0 } | |
| 46 | +{ "b": 0.10, "k1": 0.75 } | |
| 47 | + | |
| 48 | +参考修改索引级设置的方法:( BM25 `similarity.default`) | |
| 49 | + | |
| 50 | +`mappings/search_products.json` 里的 `settings.similarity` 只在**创建新索引**时生效;**已有索引**需先关闭索引,再 `PUT _settings`,最后重新打开。 | |
| 51 | + | |
| 52 | +**适用场景**:调整默认 BM25 的 `b`、`k1`(例如与仓库映射对齐:`b: 0.1`、`k1: 0.3`)。 | |
| 53 | + | |
| 54 | +```bash | |
| 55 | +# 按需替换:索引名、账号密码、ES 地址 | |
| 56 | +INDEX="search_products_tenant_163" | |
| 57 | +AUTH='saas:4hOaLaf41y2VuI8y' | |
| 58 | +ES="http://localhost:9200" | |
| 59 | + | |
| 60 | +# 1) 关闭索引(写入类请求会失败,注意维护窗口) | |
| 61 | +curl -s -u "$AUTH" -X POST "$ES/${INDEX}/_close" | |
| 62 | + | |
| 63 | +# 2) 更新设置(仅示例:与 mappings 中 default 一致时可照抄) | |
| 64 | +curl -s -u "$AUTH" -X PUT "$ES/${INDEX}/_settings" \ | |
| 65 | + -H 'Content-Type: application/json' \ | |
| 66 | + -d '{ | |
| 67 | + "index": { | |
| 68 | + "similarity": { | |
| 69 | + "default": { | |
| 70 | + "type": "BM25", | |
| 71 | + "b": 0.1, | |
| 72 | + "k1": 0.3 | |
| 73 | + } | |
| 74 | + } | |
| 75 | + } | |
| 76 | +}' | |
| 77 | + | |
| 78 | +# 3) 重新打开索引 | |
| 79 | +curl -s -u "$AUTH" -X POST "$ES/${INDEX}/_open" | |
| 80 | +``` | |
| 81 | + | |
| 82 | +**检查是否生效**: | |
| 83 | + | |
| 84 | +```bash | |
| 85 | +curl -s -u "$AUTH" -X GET "$ES/${INDEX}/_settings?filter_path=**.similarity&pretty" | |
| 86 | +``` | |
| 87 | + | |
| 88 | +期望在响应中看到 `similarity.default` 的 `type`、`b`、`k1`(API 可能将数值以字符串形式返回,属正常)。 | |
| 89 | + | |
| 90 | +**多租户批量**:先列出索引,再对每个 `search_products_tenant_*` 重复上述 close → settings → open。 | |
| 91 | + | |
| 92 | +```bash | |
| 93 | +curl -s -u "$AUTH" -X GET "$ES/_cat/indices/search_products_tenant_*?h=index&v" | |
| 94 | +``` | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | +实验结果: | |
| 99 | +已完成四轮实验,并把 `search_products_tenant_163` 当前索引参数留在了最优组:`b=0.5, k1=1.0`。刚刚已校验生效。 | |
| 100 | + | |
| 101 | +对比结果如下: | |
| 102 | + | |
| 103 | +- 基线当前线上组 `b=0.10, k1=0.30`:`0.604555` | |
| 104 | + 报告:[batch_20260408T055948Z_00b6a8aa3d.md](/data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260408T055948Z_00b6a8aa3d.md) | |
| 105 | +- 复跑 `b=0.10, k1=0.30`:`0.604298` | |
| 106 | + 报告:[batch_20260408T064011Z_00b6a8aa3d.json](/data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260408T064011Z_00b6a8aa3d.json) | |
| 107 | +- 实验 `b=0.20, k1=0.60`:`0.604274` | |
| 108 | + 报告:[batch_20260408T064123Z_00b6a8aa3d.json](/data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260408T064123Z_00b6a8aa3d.json) | |
| 109 | +- 实验 `b=0.50, k1=1.0`:`0.606041` | |
| 110 | + 报告:[batch_20260408T064216Z_00b6a8aa3d.json](/data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260408T064216Z_00b6a8aa3d.json) | |
| 111 | +- 实验 `b=0.10, k1=0.75`:`0.603438` | |
| 112 | + 报告:[batch_20260408T064304Z_00b6a8aa3d.json](/data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260408T064304Z_00b6a8aa3d.json) | |
| 113 | +- 老基线 `b=0.0, k1=0.0`:`0.602598` | |
| 114 | + 报告:[batch_20260407T150946Z_00b6a8aa3d.md](/data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260407T150946Z_00b6a8aa3d.md) | |
| 115 | + | |
| 116 | +结论是这四组里 `b=0.50, k1=1.0` 最好,相比你给的当前基线 `0.604555` 提升了 `+0.001486`,相比 `0/0` 提升了 `+0.003443`。`0.20/0.60` 基本和 `0.10/0.30` 持平,`0.10/0.75` 反而更差,说明这轮里单独加大 `k1` 不如同时把 `b` 和 `k1` 一起放宽。 | |
| 117 | + | |
| 118 | +补充一点:我只改了现有索引 `search_products_tenant_163` 的 index-level setting,没有改仓库里的默认映射。[search_products.json](/data/saas-search/mappings/search_products.json) 现在仍是 `0.1/0.3`。如果你要,我下一步可以直接帮你把映射默认值也改成 `0.5/1.0`,并顺手批量同步其他 `search_products_tenant_*` 索引。 | |
| 119 | + | |
| 120 | + | ... | ... |
| ... | ... | @@ -0,0 +1,43 @@ |
| 1 | +120.76.41.98 端口22 用户名和密码: | |
| 2 | +tw twtw@123 (有sudo权限) | |
| 3 | +这台机器上的目录/home/tw/saas-search 已经部署了本项目 | |
| 4 | +请帮我运行项目 | |
| 5 | +1. 帮我checkout一个test环境的分支,这个分支,把重排、翻译模型 都关闭掉,因为这台机gpu显存较小(embedding模型可以保留) | |
| 6 | +2. 在这个分支,把服务都启动起来 | |
| 7 | +3. 使用docker,安装一个ES,参考本项目的文档 ES9*.md。因为这台机器已经有一个系统的elasticsearch,为了不相互干扰,将本项目依赖的es9安装到docker,并且在测试环境配置的es地址做适配的工作 | |
| 8 | + | |
| 9 | + | |
| 10 | +1. 不是要禁用6005,而是6005端口已经有对应的文本服务了,直接用就行 | |
| 11 | +2. 6005其实就是本项目的一个历史早期版本启动起来的,在另外一个目录:/home/tw/SearchEngine,请看他的启动配置 | |
| 12 | +nohup bash scripts/start_embedding_service.sh > log.start_embedding_service.0412 2>&1 & | |
| 13 | +是这样启动起来的 | |
| 14 | +看他陪的文本是用的哪套方案、哪个模型,跟他对齐(我指的是当前的测试分支) | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | +我在这个机器上部署了一个测试环境: | |
| 23 | +120.76.41.98 端口22 用户名和密码: | |
| 24 | +tw twtw@123 (有sudo权限) | |
| 25 | +cd /home/tw/saas-search | |
| 26 | +$ git branch | |
| 27 | + masters RETURN) | |
| 28 | +* test/small-gpu-es9 | |
| 29 | + | |
| 30 | +我希望差异只是: | |
| 31 | +1. es配置不同(测试环境要连接到那台机器的一个docer的es 19200端口)、redis配置不同 | |
| 32 | +2. reranker关闭、不要启动reranker服务 | |
| 33 | + | |
| 34 | +其余没什么不同。 | |
| 35 | + | |
| 36 | +但是启动有问题,现在翻译报错。 | |
| 37 | +这体现了当前项目移植性比较差,我希望你检查一下失败原因,然后先到本地(本机 即当前目录master分支)优化好、提升移植性之后,那边更新,保持测试分支跟master只有少量的、配置层面的不同,让后到测试机器把翻译启动起来,最后包括整个服务都要启动起来。 | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | ... | ... |
| ... | ... | @@ -0,0 +1,25 @@ |
| 1 | +需求: | |
| 2 | +目前160条结果(rerank_window: 160)会进入重排,重排中 文本和图片向量的相关性,都会作为融合公式的因子之一(粗排和reranker都有): | |
| 3 | +knn_score | |
| 4 | +text_knn | |
| 5 | +image_knn | |
| 6 | +text_factor | |
| 7 | +knn_factor | |
| 8 | +但是文本向量召回和图片向量召回,是使用 KNN 索引召回的方式,并不是所有结果都有这两个得分,这两项得分都有为0的。 | |
| 9 | +为了解决这个问题,有一个方法是对最终能进入重排的 160 条,看其中还有哪些分别缺失文本和图片向量召回的得分,再通过某种方式让 ES 去算,或者从 ES 把向量拉回来,自己算,或者在召回的时候请求 ES 的时候,就通过某种设定,确保前面的若干条都带有这两个分数,不知道有哪些方法,我感觉这些方法都不太好,请你思考一下 | |
| 10 | + | |
| 11 | +考虑的一个方案: | |
| 12 | +想在“第一次 ES 搜索”里,只对 topN 补向量精算,考虑 rescore 或 retriever.rescorer的方案(官方明确支持多段 rescore/支持 score_mode: multiply,甚至示例里就有 function_score/script_score 放进 rescore 的写法。) | |
| 13 | +这意味着你完全可以: | |
| 14 | +初检仍然用现在的 lexical + text knn + image knn 召回候选 | |
| 15 | +对 window_size=160 做 rescore | |
| 16 | +用 exact script_score 给 top160 补 text/image vector 分 | |
| 17 | +顺手把你现在本地 coarse 融合迁回 ES | |
| 18 | + | |
| 19 | +export ES_AUTH="saas:4hOaLaf41y2VuI8y" | |
| 20 | +export ES="http://127.0.0.1:9200" | |
| 21 | +"index":"search_products_tenant_163" | |
| 22 | + | |
| 23 | +有个细节暴露出来了:dotProduct() 这类向量函数在 script_score 评分上下文能用,但在 script_fields 取字段上下文里不认。所以如果我们要把 exact 分顺手回传给 rerank,用 script_fields 的话得自己写数组循环,不能直接调向量内建函数。 | |
| 24 | + | |
| 25 | +重排打分公式需要的base_query base_query_trans_zh knn_query image_knn_query还能不能拿到?请你考虑,尽量想想如何得到这些打分,如果实在拿不到去想替代的办法比如简化打分公式。 | ... | ... |
docs/工作总结-微服务性能优化与架构.md
| ... | ... | @@ -98,7 +98,7 @@ instruction: "Given a shopping query, rank product titles by relevance" |
| 98 | 98 | **能力**:支持根据商品标题批量生成 **qanchors**(锚文本)、**enriched_attributes**、**tags**,供索引与 suggest 使用。 |
| 99 | 99 | |
| 100 | 100 | **具体内容**: |
| 101 | -- **接口**:`POST /indexer/enrich-content`(Indexer 服务端口 **6004**)。请求体为 `items` 数组,每项含 `spu_id`、`title`(必填)及可选多语言标题等;单次请求最多 **50 条**,建议批量调用。响应 `results` 与 `items` 一一对应,每项含 `spu_id`、`qanchors`(按语言键,如 `qanchors.zh`、`qanchors.en`,逗号分隔短语)、`enriched_attributes`、`tags`。 | |
| 101 | +- **接口**:`POST /indexer/enrich-content`(FacetAwareMatching 服务端口 **6001**)。请求体为 `items` 数组,每项含 `spu_id`、`title`(必填)及可选多语言标题等;单次请求最多 **50 条**,建议批量调用。响应 `results` 与 `items` 一一对应,每项含 `spu_id`、`qanchors`(按语言键,如 `qanchors.zh`、`qanchors.en`,逗号分隔短语)、`enriched_attributes`、`tags`。 | |
| 102 | 102 | - **索引侧**:微服务组合方式下,调用方先拿不含 qanchors/tags 的 doc,再调用本接口补齐后写入 ES 的 `qanchors.{lang}` 等字段;索引 transformer(`indexer/document_transformer.py`、`indexer/product_enrich.py`)内也可在构建 doc 时调用内容理解逻辑,写入 `qanchors.{lang}`。 |
| 103 | 103 | - **Suggest 侧**:`suggestion/builder.py` 从 ES 商品索引读取 `_source: ["id", "spu_id", "title", "qanchors"]`,对 `qanchors.{lang}` 用 `_split_qanchors` 拆成词条,以 `source="qanchor"` 加入候选,排序时 `qanchor` 权重大于纯 title(`add_product("qanchor", ...)`);suggest 配置中 `sources: ["query_log", "qanchor"]` 表示候选来源包含 qanchor。 |
| 104 | 104 | - **实现与依赖**:内容理解内部使用大模型(需 `DASHSCOPE_API_KEY`),支持多语言与 Redis 缓存(如 `product_anchors`);逻辑与 `indexer/product_enrich` 一致。 |
| ... | ... | @@ -129,12 +129,12 @@ instruction: "Given a shopping query, rank product titles by relevance" |
| 129 | 129 | - 可选:embedding(text) **6005**、embedding-image **6008**、translator **6006**、reranker **6007**、tei **8080**、cnclip **51000**。 |
| 130 | 130 | - 端口可由环境变量覆盖:`API_PORT`、`INDEXER_PORT`、`FRONTEND_PORT`、`EVAL_WEB_PORT`、`EMBEDDING_TEXT_PORT`、`EMBEDDING_IMAGE_PORT`、`TRANSLATION_PORT`、`RERANKER_PORT`、`TEI_PORT`、`CNCLIP_PORT`。 |
| 131 | 131 | - **命令**: |
| 132 | - - `./scripts/service_ctl.sh start [service...]` 或 `up all` / `start all`(all 含 tei、cnclip、embedding、embedding-image、translator、reranker、reranker-fine、backend、indexer、frontend、eval-web,按依赖顺序);`stop`、`restart`、`down` 同参数;`status` 默认列出所有服务。 | |
| 132 | + - `./scripts/service_ctl.sh start [service...]` 或 `up all` / `start all`(all 含 tei、cnclip、embedding、embedding-image、translator、reranker、backend、indexer、frontend、eval-web,按依赖顺序);`stop`、`restart`、`down` 同参数;`status` 默认列出所有服务。 | |
| 133 | 133 | - 启动时:backend/indexer/frontend/embedding/translator/reranker 会写 pid 到 `logs/<service>.pid`,并执行 `wait_for_health`(GET `http://127.0.0.1:<port>/health`);reranker 健康重试 90 次,其余 30 次;TEI 校验 Docker 容器存在且 `/health` 成功;cnclip 无 HTTP 健康则仅校验进程/端口。 |
| 134 | 134 | - **监控常驻**: |
| 135 | 135 | - `./scripts/service_ctl.sh monitor-start <targets>` 启动后台监控进程,将 targets 写入 `logs/service-monitor.targets`,pid 写入 `logs/service-monitor.pid`,日志追加到 `logs/service-monitor.log`。 |
| 136 | - - 轮询间隔 `MONITOR_INTERVAL_SEC` 默认 **10** 秒;连续 **3** 次(`MONITOR_FAIL_THRESHOLD`)健康失败则触发重启;重启冷却 `MONITOR_RESTART_COOLDOWN_SEC` 默认 **30** 秒;每小时最多重启 `MONITOR_MAX_RESTARTS_PER_HOUR` 默认 **6** 次;超限时调用 `scripts/wechat_alert.py` 告警(若存在)。 | |
| 137 | -- **日志**:各服务按日滚动到 `logs/<service>-<date>.log`,通过 `scripts/daily_log_router.sh` 与 `LOG_RETENTION_DAYS`(默认 30)控制保留。 | |
| 136 | + - 轮询间隔 `MONITOR_INTERVAL_SEC` 默认 **10** 秒;连续 **3** 次(`MONITOR_FAIL_THRESHOLD`)健康失败则触发重启;重启冷却 `MONITOR_RESTART_COOLDOWN_SEC` 默认 **30** 秒;每小时最多重启 `MONITOR_MAX_RESTARTS_PER_HOUR` 默认 **6** 次;超限时调用 `scripts/ops/wechat_alert.py` 告警(若存在)。 | |
| 137 | +- **日志**:各服务按日滚动到 `logs/<service>-<date>.log`,通过 `scripts/ops/daily_log_router.sh` 与 `LOG_RETENTION_DAYS`(默认 30)控制保留。 | |
| 138 | 138 | |
| 139 | 139 | 详见:`scripts/service_ctl.sh` 内注释及 `docs/Usage-Guide.md`。 |
| 140 | 140 | |
| ... | ... | @@ -153,12 +153,12 @@ instruction: "Given a shopping query, rank product titles by relevance" |
| 153 | 153 | |
| 154 | 154 | ## 三、性能测试报告摘要 |
| 155 | 155 | |
| 156 | -以下数据来自 `docs/性能测试报告.md`,测试时间 **2026-03-12**,环境:**8 vCPU**(Intel Xeon Platinum 8255C @ 2.50GHz)、**约 15Gi 可用内存**;租户 **162** 文档数约 **53**(search/search/suggestions/rerank 与文档规模相关)。压测工具:`scripts/perf_api_benchmark.py`,场景×并发矩阵,每档 **20s**。 | |
| 156 | +以下数据来自 `docs/性能测试报告.md`,测试时间 **2026-03-12**,环境:**8 vCPU**(Intel Xeon Platinum 8255C @ 2.50GHz)、**约 15Gi 可用内存**;租户 **162** 文档数约 **53**(search/search/suggestions/rerank 与文档规模相关)。压测工具:`benchmarks/perf_api_benchmark.py`,场景×并发矩阵,每档 **20s**。 | |
| 157 | 157 | |
| 158 | 158 | **复现命令(四场景×四并发)**: |
| 159 | 159 | ```bash |
| 160 | 160 | cd /data/saas-search |
| 161 | -.venv/bin/python scripts/perf_api_benchmark.py \ | |
| 161 | +.venv/bin/python benchmarks/perf_api_benchmark.py \ | |
| 162 | 162 | --scenario backend_search,backend_suggest,embed_text,rerank \ |
| 163 | 163 | --concurrency-list 1,5,10,20 \ |
| 164 | 164 | --duration 20 \ |
| ... | ... | @@ -188,7 +188,7 @@ cd /data/saas-search |
| 188 | 188 | |
| 189 | 189 | 口径:query 固定 `wireless mouse`,每次请求 **386 docs**,句长 15–40 词随机(从 1000 词池采样);配置 `rerank_window=384`。复现命令: |
| 190 | 190 | ```bash |
| 191 | -.venv/bin/python scripts/perf_api_benchmark.py \ | |
| 191 | +.venv/bin/python benchmarks/perf_api_benchmark.py \ | |
| 192 | 192 | --scenario rerank --duration 20 --concurrency-list 1,5,10,20 --timeout 60 \ |
| 193 | 193 | --rerank-dynamic-docs --rerank-doc-count 386 --rerank-vocab-size 1000 \ |
| 194 | 194 | --rerank-sentence-min-words 15 --rerank-sentence-max-words 40 \ |
| ... | ... | @@ -217,7 +217,7 @@ cd /data/saas-search |
| 217 | 217 | | 10 | 181 | 100% | 8.78 | 1129.23| 1295.88| 1330.96| |
| 218 | 218 | | 20 | 161 | 100% | 7.63 | 2594.00| 4706.44| 4783.05| |
| 219 | 219 | |
| 220 | -**结论**:吞吐约 **8 rps** 平台化,延迟随并发上升明显,符合“检索 + 向量 + 重排”重链路特征。多租户补测(文档数 500–10000,见报告 §12)表明:文档数越大,RPS 下降、延迟升高;tenant 0(10000 doc)在并发 20 出现部分 ReadTimeout(成功率 59.02%),需注意 timeout 与容量规划;补测命令示例:`for t in 0 1 2 3 4; do .venv/bin/python scripts/perf_api_benchmark.py --scenario backend_search --concurrency-list 1,5,10,20 --duration 20 --tenant-id $t --output perf_reports/2026-03-12/search_tenant_matrix/tenant_${t}.json; done`。 | |
| 220 | +**结论**:吞吐约 **8 rps** 平台化,延迟随并发上升明显,符合“检索 + 向量 + 重排”重链路特征。多租户补测(文档数 500–10000,见报告 §12)表明:文档数越大,RPS 下降、延迟升高;tenant 0(10000 doc)在并发 20 出现部分 ReadTimeout(成功率 59.02%),需注意 timeout 与容量规划;补测命令示例:`for t in 0 1 2 3 4; do .venv/bin/python benchmarks/perf_api_benchmark.py --scenario backend_search --concurrency-list 1,5,10,20 --duration 20 --tenant-id $t --output perf_reports/2026-03-12/search_tenant_matrix/tenant_${t}.json; done`。 | |
| 221 | 221 | |
| 222 | 222 | --- |
| 223 | 223 | |
| ... | ... | @@ -247,5 +247,5 @@ cd /data/saas-search |
| 247 | 247 | |
| 248 | 248 | **关键文件与复现**: |
| 249 | 249 | - 配置:`config/config.yaml`(services、rerank、query_config)、`.env`(端口与 API Key)。 |
| 250 | -- 脚本:`scripts/service_ctl.sh`(启停与监控)、`scripts/perf_api_benchmark.py`(压测)、`scripts/build_suggestions.sh`(suggest 构建)。 | |
| 250 | +- 脚本:`scripts/service_ctl.sh`(启停与监控)、`benchmarks/perf_api_benchmark.py`(压测)、`scripts/build_suggestions.sh`(suggest 构建)。 | |
| 251 | 251 | - 完整步骤与多租户/rerank 对比见:`docs/性能测试报告.md`。 | ... | ... |
docs/常用查询 - ES.md
| 1 | - | |
| 2 | - | |
| 3 | 1 | ## Elasticsearch 排查流程 |
| 4 | 2 | |
| 3 | +使用前加载环境变量: | |
| 4 | +```bash | |
| 5 | +set -a; source .env; set +a | |
| 6 | +# 或直接 export | |
| 7 | +export ES_AUTH="saas:4hOaLaf41y2VuI8y" | |
| 8 | +export ES="http://127.0.0.1:9200" | |
| 9 | +``` | |
| 10 | + | |
| 5 | 11 | ### 1. 集群健康状态 |
| 6 | 12 | |
| 7 | 13 | ```bash |
| 8 | 14 | # 集群整体健康(green / yellow / red) |
| 9 | -curl -s -u 'saas:4hOaLaf41y2VuI8y' 'http://127.0.0.1:9200/_cluster/health?pretty' | |
| 15 | +curl -s -u "$ES_AUTH" 'http://127.0.0.1:9200/_cluster/health?pretty' | |
| 10 | 16 | ``` |
| 11 | 17 | |
| 12 | 18 | ### 2. 索引概览 |
| 13 | 19 | |
| 14 | 20 | ```bash |
| 15 | 21 | # 查看所有租户索引状态与体积 |
| 16 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/_cat/indices/search_products_tenant_*?v' | |
| 22 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/_cat/indices/search_products_tenant_*?v' | |
| 17 | 23 | |
| 18 | 24 | # 或查看全部索引 |
| 19 | -curl -s -u 'saas:4hOaLaf41y2VuI8y' 'http://127.0.0.1:9200/_cat/indices?v' | |
| 25 | +curl -s -u "$ES_AUTH" 'http://127.0.0.1:9200/_cat/indices?v' | |
| 20 | 26 | ``` |
| 21 | 27 | |
| 22 | 28 | ### 3. 分片分布 |
| 23 | 29 | |
| 24 | 30 | ```bash |
| 25 | 31 | # 查看分片在各节点的分布情况 |
| 26 | -curl -s -u 'saas:4hOaLaf41y2VuI8y' 'http://127.0.0.1:9200/_cat/shards?v' | |
| 32 | +curl -s -u "$ES_AUTH" 'http://127.0.0.1:9200/_cat/shards?v' | |
| 27 | 33 | ``` |
| 28 | 34 | |
| 29 | 35 | ### 4. 分配诊断(如有异常) |
| 30 | 36 | |
| 31 | 37 | ```bash |
| 32 | 38 | # 当 health 非 green 或 shards 状态异常时,定位具体原因 |
| 33 | -curl -s -u 'saas:4hOaLaf41y2VuI8y' -X POST 'http://127.0.0.1:9200/_cluster/allocation/explain?pretty' \ | |
| 39 | +curl -s -u "$ES_AUTH" -X POST 'http://127.0.0.1:9200/_cluster/allocation/explain?pretty' \ | |
| 34 | 40 | -H 'Content-Type: application/json' \ |
| 35 | 41 | -d '{"index":"search_products_tenant_163","shard":0,"primary":true}' |
| 36 | 42 | ``` |
| ... | ... | @@ -60,6 +66,54 @@ cat /etc/elasticsearch/elasticsearch.yml |
| 60 | 66 | journalctl -u elasticsearch -f |
| 61 | 67 | ``` |
| 62 | 68 | |
| 69 | +### 7. 修改索引级设置(如 BM25 `similarity.default`) | |
| 70 | + | |
| 71 | +`mappings/search_products.json` 里的 `settings.similarity` 只在**创建新索引**时生效;**已有索引**需先关闭索引,再 `PUT _settings`,最后重新打开。 | |
| 72 | + | |
| 73 | +**适用场景**:调整默认 BM25 的 `b`、`k1`(例如与仓库映射对齐:`b: 0.1`、`k1: 0.3`)。 | |
| 74 | + | |
| 75 | +```bash | |
| 76 | +# 按需替换:索引名、账号密码、ES 地址 | |
| 77 | +INDEX="search_products_tenant_163" | |
| 78 | +AUTH="$ES_AUTH" | |
| 79 | +ES="http://localhost:9200" | |
| 80 | + | |
| 81 | +# 1) 关闭索引(写入类请求会失败,注意维护窗口) | |
| 82 | +curl -s -u "$AUTH" -X POST "$ES/${INDEX}/_close" | |
| 83 | + | |
| 84 | +# 2) 更新设置(仅示例:与 mappings 中 default 一致时可照抄) | |
| 85 | +curl -s -u "$AUTH" -X PUT "$ES/${INDEX}/_settings" \ | |
| 86 | + -H 'Content-Type: application/json' \ | |
| 87 | + -d '{ | |
| 88 | + "index": { | |
| 89 | + "similarity": { | |
| 90 | + "default": { | |
| 91 | + "type": "BM25", | |
| 92 | + "b": 0.1, | |
| 93 | + "k1": 0.3 | |
| 94 | + } | |
| 95 | + } | |
| 96 | + } | |
| 97 | +}' | |
| 98 | + | |
| 99 | +# 3) 重新打开索引 | |
| 100 | +curl -s -u "$AUTH" -X POST "$ES/${INDEX}/_open" | |
| 101 | +``` | |
| 102 | + | |
| 103 | +**检查是否生效**: | |
| 104 | + | |
| 105 | +```bash | |
| 106 | +curl -s -u "$AUTH" -X GET "$ES/${INDEX}/_settings?filter_path=**.similarity&pretty" | |
| 107 | +``` | |
| 108 | + | |
| 109 | +期望在响应中看到 `similarity.default` 的 `type`、`b`、`k1`(API 可能将数值以字符串形式返回,属正常)。 | |
| 110 | + | |
| 111 | +**多租户批量**:先列出索引,再对每个 `search_products_tenant_*` 重复上述 close → settings → open。 | |
| 112 | + | |
| 113 | +```bash | |
| 114 | +curl -s -u "$AUTH" -X GET "$ES/_cat/indices/search_products_tenant_*?h=index&v" | |
| 115 | +``` | |
| 116 | + | |
| 63 | 117 | --- |
| 64 | 118 | |
| 65 | 119 | ### 快速排查路径 |
| ... | ... | @@ -93,7 +147,7 @@ systemctl / df / 日志 → 系统层验证 |
| 93 | 147 | |
| 94 | 148 | #### 查询指定 spu_id 的商品(返回 title) |
| 95 | 149 | ```bash |
| 96 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 150 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 97 | 151 | "size": 11, |
| 98 | 152 | "_source": ["title"], |
| 99 | 153 | "query": { |
| ... | ... | @@ -108,7 +162,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_te |
| 108 | 162 | |
| 109 | 163 | #### 查询所有商品(返回 title) |
| 110 | 164 | ```bash |
| 111 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 165 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 112 | 166 | "size": 100, |
| 113 | 167 | "_source": ["title"], |
| 114 | 168 | "query": { |
| ... | ... | @@ -119,7 +173,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_te |
| 119 | 173 | |
| 120 | 174 | #### 查询指定 spu_id 的商品(返回 title、keywords、tags) |
| 121 | 175 | ```bash |
| 122 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 176 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 123 | 177 | "size": 5, |
| 124 | 178 | "_source": ["title", "keywords", "tags"], |
| 125 | 179 | "query": { |
| ... | ... | @@ -134,7 +188,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_te |
| 134 | 188 | |
| 135 | 189 | #### 组合查询:匹配标题 + 过滤标签 |
| 136 | 190 | ```bash |
| 137 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 191 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 138 | 192 | "size": 1, |
| 139 | 193 | "_source": ["title", "keywords", "tags"], |
| 140 | 194 | "query": { |
| ... | ... | @@ -158,7 +212,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_te |
| 158 | 212 | |
| 159 | 213 | #### 组合查询:匹配标题 + 过滤租户(冗余示例) |
| 160 | 214 | ```bash |
| 161 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 215 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 162 | 216 | "size": 1, |
| 163 | 217 | "_source": ["title"], |
| 164 | 218 | "query": { |
| ... | ... | @@ -186,7 +240,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_te |
| 186 | 240 | |
| 187 | 241 | #### 测试 index_ik 分析器 |
| 188 | 242 | ```bash |
| 189 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_analyze' -H 'Content-Type: application/json' -d '{ | |
| 243 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_analyze' -H 'Content-Type: application/json' -d '{ | |
| 190 | 244 | "analyzer": "index_ik", |
| 191 | 245 | "text": "14寸第4代-眼珠实身冰雪公仔带手动大推车,搪胶雪宝宝" |
| 192 | 246 | }' |
| ... | ... | @@ -194,7 +248,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_te |
| 194 | 248 | |
| 195 | 249 | #### 测试 query_ik 分析器 |
| 196 | 250 | ```bash |
| 197 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_analyze' -H 'Content-Type: application/json' -d '{ | |
| 251 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_analyze' -H 'Content-Type: application/json' -d '{ | |
| 198 | 252 | "analyzer": "query_ik", |
| 199 | 253 | "text": "14寸第4代-眼珠实身冰雪公仔带手动大推车,搪胶雪宝宝" |
| 200 | 254 | }' |
| ... | ... | @@ -206,7 +260,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_te |
| 206 | 260 | |
| 207 | 261 | #### 多字段匹配 + 聚合(category1、color、size、material) |
| 208 | 262 | ```bash |
| 209 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 263 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 210 | 264 | "size": 1, |
| 211 | 265 | "from": 0, |
| 212 | 266 | "query": { |
| ... | ... | @@ -316,7 +370,7 @@ GET /search_products_tenant_2/_search |
| 316 | 370 | |
| 317 | 371 | #### 按 spu_id 查询(通用索引) |
| 318 | 372 | ```bash |
| 319 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 373 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 320 | 374 | "size": 5, |
| 321 | 375 | "query": { |
| 322 | 376 | "bool": { |
| ... | ... | @@ -333,7 +387,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products/_s |
| 333 | 387 | ### 5. 统计租户总文档数 |
| 334 | 388 | |
| 335 | 389 | ```bash |
| 336 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_count?pretty' -H 'Content-Type: application/json' -d '{ | |
| 390 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_170/_count?pretty' -H 'Content-Type: application/json' -d '{ | |
| 337 | 391 | "query": { |
| 338 | 392 | "match_all": {} |
| 339 | 393 | } |
| ... | ... | @@ -348,7 +402,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_te |
| 348 | 402 | |
| 349 | 403 | #### 1.1 查询特定租户的商品,显示分面相关字段 |
| 350 | 404 | ```bash |
| 351 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 405 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 352 | 406 | "query": { |
| 353 | 407 | "term": { "tenant_id": "162" } |
| 354 | 408 | }, |
| ... | ... | @@ -363,7 +417,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_te |
| 363 | 417 | |
| 364 | 418 | #### 1.2 验证 category1_name 字段是否有数据 |
| 365 | 419 | ```bash |
| 366 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 420 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 367 | 421 | "query": { |
| 368 | 422 | "bool": { |
| 369 | 423 | "filter": [ |
| ... | ... | @@ -378,7 +432,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_te |
| 378 | 432 | |
| 379 | 433 | #### 1.3 验证 specifications 字段是否有数据 |
| 380 | 434 | ```bash |
| 381 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 435 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 382 | 436 | "query": { |
| 383 | 437 | "bool": { |
| 384 | 438 | "filter": [ |
| ... | ... | @@ -397,7 +451,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_te |
| 397 | 451 | |
| 398 | 452 | #### 2.1 category1_name 分面聚合 |
| 399 | 453 | ```bash |
| 400 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 454 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 401 | 455 | "query": { "match_all": {} }, |
| 402 | 456 | "size": 0, |
| 403 | 457 | "aggs": { |
| ... | ... | @@ -410,7 +464,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_te |
| 410 | 464 | |
| 411 | 465 | #### 2.2 specifications.color 分面聚合 |
| 412 | 466 | ```bash |
| 413 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 467 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 414 | 468 | "query": { "match_all": {} }, |
| 415 | 469 | "size": 0, |
| 416 | 470 | "aggs": { |
| ... | ... | @@ -431,7 +485,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_te |
| 431 | 485 | |
| 432 | 486 | #### 2.3 specifications.size 分面聚合 |
| 433 | 487 | ```bash |
| 434 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 488 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 435 | 489 | "query": { "match_all": {} }, |
| 436 | 490 | "size": 0, |
| 437 | 491 | "aggs": { |
| ... | ... | @@ -452,7 +506,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_te |
| 452 | 506 | |
| 453 | 507 | #### 2.4 specifications.material 分面聚合 |
| 454 | 508 | ```bash |
| 455 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 509 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 456 | 510 | "query": { "match_all": {} }, |
| 457 | 511 | "size": 0, |
| 458 | 512 | "aggs": { |
| ... | ... | @@ -473,7 +527,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_te |
| 473 | 527 | |
| 474 | 528 | #### 2.5 综合分面聚合(category + color + size + material) |
| 475 | 529 | ```bash |
| 476 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 530 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 477 | 531 | "query": { "match_all": {} }, |
| 478 | 532 | "size": 0, |
| 479 | 533 | "aggs": { |
| ... | ... | @@ -515,7 +569,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_te |
| 515 | 569 | |
| 516 | 570 | #### 3.1 查看 specifications 的 name 字段有哪些值 |
| 517 | 571 | ```bash |
| 518 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 572 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 519 | 573 | "query": { "term": { "tenant_id": "162" } }, |
| 520 | 574 | "size": 0, |
| 521 | 575 | "aggs": { |
| ... | ... | @@ -531,7 +585,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products/_s |
| 531 | 585 | |
| 532 | 586 | #### 3.2 查看某个商品的完整 specifications 数据 |
| 533 | 587 | ```bash |
| 534 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 588 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 535 | 589 | "query": { |
| 536 | 590 | "bool": { |
| 537 | 591 | "filter": [ |
| ... | ... | @@ -552,7 +606,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products/_s |
| 552 | 606 | **keyword 精确匹配**(示例词:中文 `法式风格`,英文 `long skirt`) |
| 553 | 607 | |
| 554 | 608 | ```bash |
| 555 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 609 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 556 | 610 | "size": 1, |
| 557 | 611 | "_source": ["spu_id", "title", "enriched_attributes"], |
| 558 | 612 | "query": { |
| ... | ... | @@ -575,7 +629,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_te |
| 575 | 629 | **text 全文匹配**(经 `index_ik` / `english` 分词;可与上式对照) |
| 576 | 630 | |
| 577 | 631 | ```bash |
| 578 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 632 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 579 | 633 | "size": 1, |
| 580 | 634 | "_source": ["spu_id", "title", "enriched_attributes"], |
| 581 | 635 | "query": { |
| ... | ... | @@ -602,7 +656,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_te |
| 602 | 656 | **keyword 精确匹配** |
| 603 | 657 | |
| 604 | 658 | ```bash |
| 605 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 659 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 606 | 660 | "size": 1, |
| 607 | 661 | "_source": ["spu_id", "title", "option1_values"], |
| 608 | 662 | "query": { |
| ... | ... | @@ -620,7 +674,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_te |
| 620 | 674 | **text 全文匹配** |
| 621 | 675 | |
| 622 | 676 | ```bash |
| 623 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 677 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 624 | 678 | "size": 1, |
| 625 | 679 | "_source": ["spu_id", "title", "option1_values"], |
| 626 | 680 | "query": { |
| ... | ... | @@ -640,7 +694,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_te |
| 640 | 694 | **keyword 精确匹配** |
| 641 | 695 | |
| 642 | 696 | ```bash |
| 643 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 697 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 644 | 698 | "size": 1, |
| 645 | 699 | "_source": ["spu_id", "title", "enriched_tags"], |
| 646 | 700 | "query": { |
| ... | ... | @@ -658,7 +712,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_te |
| 658 | 712 | **text 全文匹配** |
| 659 | 713 | |
| 660 | 714 | ```bash |
| 661 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 715 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 662 | 716 | "size": 1, |
| 663 | 717 | "_source": ["spu_id", "title", "enriched_tags"], |
| 664 | 718 | "query": { |
| ... | ... | @@ -678,7 +732,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_te |
| 678 | 732 | > `specifications` 为 **nested**,`value_keyword` 为整词匹配;`value_text.*` 可同时 `term` 子字段或 `match` 主 text。 |
| 679 | 733 | |
| 680 | 734 | ```bash |
| 681 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 735 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 682 | 736 | "size": 1, |
| 683 | 737 | "_source": ["spu_id", "title", "specifications"], |
| 684 | 738 | "query": { |
| ... | ... | @@ -710,7 +764,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_te |
| 710 | 764 | |
| 711 | 765 | #### 4.1 统计有 category1_name 的文档数量 |
| 712 | 766 | ```bash |
| 713 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_count?pretty' -H 'Content-Type: application/json' -d '{ | |
| 767 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_count?pretty' -H 'Content-Type: application/json' -d '{ | |
| 714 | 768 | "query": { |
| 715 | 769 | "bool": { |
| 716 | 770 | "filter": [ |
| ... | ... | @@ -723,7 +777,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_te |
| 723 | 777 | |
| 724 | 778 | #### 4.2 统计有 specifications 的文档数量 |
| 725 | 779 | ```bash |
| 726 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_count?pretty' -H 'Content-Type: application/json' -d '{ | |
| 780 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_count?pretty' -H 'Content-Type: application/json' -d '{ | |
| 727 | 781 | "query": { |
| 728 | 782 | "bool": { |
| 729 | 783 | "filter": [ |
| ... | ... | @@ -740,7 +794,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_te |
| 740 | 794 | |
| 741 | 795 | #### 5.1 查找没有 category1_name 但有 category 的文档(MySQL 有数据但 ES 没有) |
| 742 | 796 | ```bash |
| 743 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 797 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 744 | 798 | "query": { |
| 745 | 799 | "bool": { |
| 746 | 800 | "filter": [ |
| ... | ... | @@ -758,7 +812,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_te |
| 758 | 812 | |
| 759 | 813 | #### 5.2 查找有 option 但没有 specifications 的文档(数据转换问题) |
| 760 | 814 | ```bash |
| 761 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 815 | +curl -u "$ES_AUTH" -X GET 'http://localhost:9200/search_products_tenant_163/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 762 | 816 | "query": { |
| 763 | 817 | "bool": { |
| 764 | 818 | "filter": [ |
| ... | ... | @@ -814,7 +868,7 @@ GET search_products_tenant_163/_mapping |
| 814 | 868 | GET search_products_tenant_163/_field_caps?fields=* |
| 815 | 869 | |
| 816 | 870 | ```bash |
| 817 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X POST \ | |
| 871 | +curl -u "$ES_AUTH" -X POST \ | |
| 818 | 872 | 'http://localhost:9200/search_products_tenant_163/_count' \ |
| 819 | 873 | -H 'Content-Type: application/json' \ |
| 820 | 874 | -d '{ |
| ... | ... | @@ -827,7 +881,7 @@ curl -u 'saas:4hOaLaf41y2VuI8y' -X POST \ |
| 827 | 881 | } |
| 828 | 882 | }' |
| 829 | 883 | |
| 830 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X POST \ | |
| 884 | +curl -u "$ES_AUTH" -X POST \ | |
| 831 | 885 | 'http://localhost:9200/search_products_tenant_163/_count' \ |
| 832 | 886 | -H 'Content-Type: application/json' \ |
| 833 | 887 | -d '{ | ... | ... |
docs/性能测试报告.md
| ... | ... | @@ -18,13 +18,13 @@ |
| 18 | 18 | |
| 19 | 19 | 执行方式: |
| 20 | 20 | - 每组压测持续 `20s` |
| 21 | -- 使用统一脚本 `scripts/perf_api_benchmark.py` | |
| 21 | +- 使用统一脚本 `benchmarks/perf_api_benchmark.py` | |
| 22 | 22 | - 通过 `--scenario` 多值 + `--concurrency-list` 一次性跑完 `场景 x 并发` |
| 23 | 23 | |
| 24 | 24 | ## 3. 压测工具优化说明(复用现有脚本) |
| 25 | 25 | |
| 26 | 26 | 为了解决原脚本“一次只能跑一个场景+一个并发”的可用性问题,本次直接扩展现有脚本: |
| 27 | -- `scripts/perf_api_benchmark.py` | |
| 27 | +- `benchmarks/perf_api_benchmark.py` | |
| 28 | 28 | |
| 29 | 29 | 能力: |
| 30 | 30 | - 一条命令执行 `场景列表 x 并发列表` 全矩阵 |
| ... | ... | @@ -33,7 +33,7 @@ |
| 33 | 33 | 示例: |
| 34 | 34 | |
| 35 | 35 | ```bash |
| 36 | -.venv/bin/python scripts/perf_api_benchmark.py \ | |
| 36 | +.venv/bin/python benchmarks/perf_api_benchmark.py \ | |
| 37 | 37 | --scenario backend_search,backend_suggest,embed_text,rerank \ |
| 38 | 38 | --concurrency-list 1,5,10,20 \ |
| 39 | 39 | --duration 20 \ |
| ... | ... | @@ -106,7 +106,7 @@ curl -sS http://127.0.0.1:6007/health |
| 106 | 106 | |
| 107 | 107 | ```bash |
| 108 | 108 | cd /data/saas-search |
| 109 | -.venv/bin/python scripts/perf_api_benchmark.py \ | |
| 109 | +.venv/bin/python benchmarks/perf_api_benchmark.py \ | |
| 110 | 110 | --scenario backend_search,backend_suggest,embed_text,rerank \ |
| 111 | 111 | --concurrency-list 1,5,10,20 \ |
| 112 | 112 | --duration 20 \ |
| ... | ... | @@ -164,7 +164,7 @@ cd /data/saas-search |
| 164 | 164 | 复现命令: |
| 165 | 165 | |
| 166 | 166 | ```bash |
| 167 | -.venv/bin/python scripts/perf_api_benchmark.py \ | |
| 167 | +.venv/bin/python benchmarks/perf_api_benchmark.py \ | |
| 168 | 168 | --scenario rerank \ |
| 169 | 169 | --duration 20 \ |
| 170 | 170 | --concurrency-list 1,5,10,20 \ |
| ... | ... | @@ -237,7 +237,7 @@ cd /data/saas-search |
| 237 | 237 | - 使用项目虚拟环境执行: |
| 238 | 238 | |
| 239 | 239 | ```bash |
| 240 | -.venv/bin/python scripts/perf_api_benchmark.py -h | |
| 240 | +.venv/bin/python benchmarks/perf_api_benchmark.py -h | |
| 241 | 241 | ``` |
| 242 | 242 | |
| 243 | 243 | ### 10.3 某场景成功率下降 |
| ... | ... | @@ -249,7 +249,7 @@ cd /data/saas-search |
| 249 | 249 | |
| 250 | 250 | ## 11. 关联文件 |
| 251 | 251 | |
| 252 | -- 压测脚本:`scripts/perf_api_benchmark.py` | |
| 252 | +- 压测脚本:`benchmarks/perf_api_benchmark.py` | |
| 253 | 253 | - 本次结果:`perf_reports/2026-03-12/perf_matrix_report.json` |
| 254 | 254 | - Search 多租户补测:`perf_reports/2026-03-12/search_tenant_matrix/` |
| 255 | 255 | - Reranker 386 docs 口径补测:`perf_reports/2026-03-12/rerank_realistic/rerank_386docs.json` |
| ... | ... | @@ -280,7 +280,7 @@ cd /data/saas-search |
| 280 | 280 | cd /data/saas-search |
| 281 | 281 | mkdir -p perf_reports/2026-03-12/search_tenant_matrix |
| 282 | 282 | for t in 0 1 2 3 4; do |
| 283 | - .venv/bin/python scripts/perf_api_benchmark.py \ | |
| 283 | + .venv/bin/python benchmarks/perf_api_benchmark.py \ | |
| 284 | 284 | --scenario backend_search \ |
| 285 | 285 | --concurrency-list 1,5,10,20 \ |
| 286 | 286 | --duration 20 \ | ... | ... |
docs/搜索API对接指南-00-总览与快速开始.md
| ... | ... | @@ -90,7 +90,7 @@ curl -X POST "http://43.166.252.75:6002/search/" \ |
| 90 | 90 | | 查询文档 | POST | `/indexer/documents` | 查询SPU文档数据(不写入ES) | |
| 91 | 91 | | 构建ES文档(正式对接) | POST | `/indexer/build-docs` | 基于上游提供的 MySQL 行数据构建 ES doc,不写入 ES,供 Java 等调用后自行写入 | |
| 92 | 92 | | 构建ES文档(测试用) | POST | `/indexer/build-docs-from-db` | 仅在测试/调试时使用,根据 `tenant_id + spu_ids` 内部查库并构建 ES doc | |
| 93 | -| 内容理解字段生成 | POST | `/indexer/enrich-content` | 根据商品标题批量生成 qanchors、enriched_attributes、tags,供微服务组合方式使用 | | |
| 93 | +| 内容理解字段生成 | POST | `/indexer/enrich-content` | 根据商品标题批量生成 qanchors、enriched_attributes、tags,供微服务组合方式使用(独立服务端口 6001) | | |
| 94 | 94 | | 索引健康检查 | GET | `/indexer/health` | 检查索引服务状态 | |
| 95 | 95 | | 健康检查 | GET | `/admin/health` | 服务健康检查 | |
| 96 | 96 | | 获取配置 | GET | `/admin/config` | 获取租户配置 | |
| ... | ... | @@ -104,7 +104,6 @@ curl -X POST "http://43.166.252.75:6002/search/" \ |
| 104 | 104 | | 向量服务(图片) | 6008 | `POST /embed/image` | 图片向量化 | |
| 105 | 105 | | 翻译服务 | 6006 | `POST /translate` | 文本翻译(支持 qwen-mt / llm / deepl / 本地模型) | |
| 106 | 106 | | 重排服务 | 6007 | `POST /rerank` | 检索结果重排 | |
| 107 | -| 内容理解(Indexer 内) | 6004 | `POST /indexer/enrich-content` | 根据商品标题生成 qanchors、tags 等,供 indexer 微服务组合方式使用 | | |
| 107 | +| 内容理解(独立服务) | 6001 | `POST /indexer/enrich-content` | 根据商品标题生成 qanchors、tags 等,供 indexer 微服务组合方式使用 | | |
| 108 | 108 | |
| 109 | 109 | --- |
| 110 | - | ... | ... |
docs/搜索API对接指南-05-索引接口(Indexer).md
| ... | ... | @@ -13,7 +13,7 @@ |
| 13 | 13 | | 查询文档 | POST | `/indexer/documents` | 按 SPU ID 列表查询 ES 文档,不写入 ES | |
| 14 | 14 | | 构建 ES 文档(正式) | POST | `/indexer/build-docs` | 由上游提供 MySQL 行数据,返回 ES-ready 文档,不写 ES | |
| 15 | 15 | | 构建 ES 文档(测试) | POST | `/indexer/build-docs-from-db` | 由本服务查库并构建文档,仅测试/调试用 | |
| 16 | -| 内容理解字段生成 | POST | `/indexer/enrich-content` | 根据商品标题批量生成 qanchors、enriched_attributes、tags(供微服务组合方式使用) | | |
| 16 | +| 内容理解字段生成 | POST | `/indexer/enrich-content` | 根据商品标题批量生成 qanchors、enriched_attributes、tags(供微服务组合方式使用;独立服务端口 6001) | | |
| 17 | 17 | | 索引健康检查 | GET | `/indexer/health` | 检查索引服务与数据库连接状态 | |
| 18 | 18 | |
| 19 | 19 | #### 5.0 支撑外部 indexer 的三种方式 |
| ... | ... | @@ -23,7 +23,7 @@ |
| 23 | 23 | | 方式 | 说明 | 适用场景 | |
| 24 | 24 | |------|------|----------| |
| 25 | 25 | | **1)doc 填充接口** | 调用 `POST /indexer/build-docs` 或 `POST /indexer/build-docs-from-db`,由本服务基于 MySQL 行数据构建完整 ES 文档(含多语言、向量、规格等),**不写入 ES**,由调用方自行写入。 | 希望一站式拿到 ES-ready doc,由己方控制写 ES 的时机与索引名。 | |
| 26 | -| **2)微服务组合** | 单独调用**翻译**、**向量化**、**内容理解字段生成**等接口,由 indexer 程序自己组装 doc 并写入 ES。翻译与向量化为独立微服务(见第 7 节);内容理解为 Indexer 服务内接口 `POST /indexer/enrich-content`。 | 需要灵活编排、或希望将 LLM/向量等耗时步骤与主链路解耦(如异步补齐 qanchors/tags)。 | | |
| 26 | +| **2)微服务组合** | 单独调用**翻译**、**向量化**、**内容理解字段生成**等接口,由 indexer 程序自己组装 doc 并写入 ES。翻译与向量化为独立微服务(见第 7 节);内容理解为 FacetAwareMatching 独立服务接口 `POST /indexer/enrich-content`(端口 6001)。 | 需要灵活编排、或希望将 LLM/向量等耗时步骤与主链路解耦(如异步补齐 qanchors/tags)。 | | |
| 27 | 27 | | **3)本服务直接写 ES** | 调用全量索引 `POST /indexer/reindex`、增量索引 `POST /indexer/index`(指定 SPU ID 列表),由本服务从 MySQL 拉数并直接写入 ES。 | 自建运维、联调或不需要由 Java 写 ES 的场景。 | |
| 28 | 28 | |
| 29 | 29 | - **方式 1** 与 **方式 2** 下,ES 的写入方均为外部 indexer(或 Java),职责清晰。 |
| ... | ... | @@ -498,7 +498,7 @@ curl -X GET "http://localhost:6004/indexer/health" |
| 498 | 498 | |
| 499 | 499 | #### 请求示例(完整 curl) |
| 500 | 500 | |
| 501 | -> 完整请求体参考 `scripts/test_build_docs_api.py` 中的 `build_sample_request()`。 | |
| 501 | +> 完整请求体参考 `tests/manual/test_build_docs_api.py` 中的 `build_sample_request()`。 | |
| 502 | 502 | |
| 503 | 503 | ```bash |
| 504 | 504 | # 单条 SPU 示例(含 spu、skus、options) |
| ... | ... | @@ -648,13 +648,38 @@ curl -X POST "http://127.0.0.1:6004/indexer/build-docs-from-db" \ |
| 648 | 648 | ### 5.8 内容理解字段生成接口 |
| 649 | 649 | |
| 650 | 650 | - **端点**: `POST /indexer/enrich-content` |
| 651 | -- **描述**: 根据商品内容信息批量生成 **qanchors**(锚文本)、**enriched_attributes**(语义属性)、**enriched_tags**(细分标签),供外部 indexer 在「微服务组合」方式下自行拼装 doc 时使用。请求以 `items[]` 传入商品内容字段(必填/可选见下表)。接口只暴露商品内容输入,语言选择、分析维度与最终字段结构统一由 `indexer.product_enrich` 内部决定;当前返回结果与 `search_products` mapping 保持一致。单次请求在线程池中执行,避免阻塞其他接口。 | |
| 651 | +- **服务**: FacetAwareMatching 独立服务(默认端口 **6001**;由 `/data/FacetAwareMatching/scripts/service_ctl.sh` 管理) | |
| 652 | +- **描述**: 根据商品内容信息批量生成 **qanchors**(锚文本)、**enriched_attributes**(通用语义属性)、**enriched_tags**(细分标签)、**enriched_taxonomy_attributes**(taxonomy 结构化属性),供外部 indexer 在「微服务组合」方式下自行拼装 doc 时使用。请求以 `items[]` 传入商品内容字段(必填/可选见下表)。接口只暴露商品内容输入,语言选择、分析维度与最终字段结构统一由 FacetAwareMatching 的 `product_enrich` 内部决定;当前返回结果与 `search_products` mapping 保持一致。单次请求在线程池中执行,避免阻塞其他接口。 | |
| 653 | + | |
| 654 | +当前支持的 `category_taxonomy_profile`: | |
| 655 | +- `apparel` | |
| 656 | +- `3c` | |
| 657 | +- `bags` | |
| 658 | +- `pet_supplies` | |
| 659 | +- `electronics` | |
| 660 | +- `outdoor` | |
| 661 | +- `home_appliances` | |
| 662 | +- `home_living` | |
| 663 | +- `wigs` | |
| 664 | +- `beauty` | |
| 665 | +- `accessories` | |
| 666 | +- `toys` | |
| 667 | +- `shoes` | |
| 668 | +- `sports` | |
| 669 | +- `others` | |
| 670 | + | |
| 671 | +说明: | |
| 672 | +- 所有 profile 的 `enriched_taxonomy_attributes.value` 都统一返回 `zh` + `en`。 | |
| 673 | +- 外部调用 `/indexer/enrich-content` 时,以请求中的 `category_taxonomy_profile` 为准。 | |
| 674 | +- 若 indexer 内部仍接入内容理解能力,taxonomy profile 请在调用侧显式传入(建议仍以租户行业配置为准)。 | |
| 652 | 675 | |
| 653 | 676 | #### 请求参数 |
| 654 | 677 | |
| 655 | 678 | ```json |
| 656 | 679 | { |
| 657 | 680 | "tenant_id": "170", |
| 681 | + "enrichment_scopes": ["generic", "category_taxonomy"], | |
| 682 | + "category_taxonomy_profile": "apparel", | |
| 658 | 683 | "items": [ |
| 659 | 684 | { |
| 660 | 685 | "spu_id": "223167", |
| ... | ... | @@ -675,6 +700,8 @@ curl -X POST "http://127.0.0.1:6004/indexer/build-docs-from-db" \ |
| 675 | 700 | | 参数 | 类型 | 必填 | 默认值 | 说明 | |
| 676 | 701 | |------|------|------|--------|------| |
| 677 | 702 | | `tenant_id` | string | Y | - | 租户 ID。目前仅用于记录日志,不产生实际作用| |
| 703 | +| `enrichment_scopes` | array[string] | N | `["generic", "category_taxonomy"]` | 选择要执行的增强范围。`generic` 生成 `qanchors`/`enriched_tags`/`enriched_attributes`,`category_taxonomy` 生成 `enriched_taxonomy_attributes` | | |
| 704 | +| `category_taxonomy_profile` | string | N | `apparel` | 品类 taxonomy profile。支持:`apparel`、`3c`、`bags`、`pet_supplies`、`electronics`、`outdoor`、`home_appliances`、`home_living`、`wigs`、`beauty`、`accessories`、`toys`、`shoes`、`sports`、`others` | | |
| 678 | 705 | | `items` | array | Y | - | 待分析列表;**单次最多 50 条** | |
| 679 | 706 | |
| 680 | 707 | `items[]` 字段说明: |
| ... | ... | @@ -683,21 +710,24 @@ curl -X POST "http://127.0.0.1:6004/indexer/build-docs-from-db" \ |
| 683 | 710 | |------|------|------|------| |
| 684 | 711 | | `spu_id` | string | Y | SPU ID,用于回填结果;目前仅用于记录日志,不产生实际作用| |
| 685 | 712 | | `title` | string | Y | 商品标题 | |
| 686 | -| `image_url` | string | N | 商品主图 URL;当前会参与内容缓存键,后续可用于图像/多模态内容理解 | | |
| 687 | -| `brief` | string | N | 商品简介/短描述;当前会参与内容缓存键 | | |
| 688 | -| `description` | string | N | 商品详情/长描述;当前会参与内容缓存键 | | |
| 713 | +| `image_url` | string | N | 商品主图 URL;当前仅透传,暂未参与 prompt 与缓存键,后续可用于图像/多模态内容理解 | | |
| 714 | +| `brief` | string | N | 商品简介/短描述;当前会参与 prompt 与缓存键 | | |
| 715 | +| `description` | string | N | 商品详情/长描述;当前会参与 prompt 与缓存键 | | |
| 689 | 716 | |
| 690 | 717 | 缓存说明: |
| 691 | 718 | |
| 692 | -- 内容缓存键仅由 `target_lang + items[]` 中会影响内容理解结果的输入文本构成,目前包括:`title`、`brief`、`description`、`image_url` 的规范化内容 hash。 | |
| 719 | +- 内容缓存按 **增强范围 + taxonomy profile** 拆分;`generic` 与 `category_taxonomy:apparel` 等使用不同缓存命名空间,互不污染、可独立演进。 | |
| 720 | +- 缓存键由 `analysis_kind + target_lang + prompt/schema 版本指纹 + prompt 输入文本 hash` 构成;对 category taxonomy 来说,profile 会进入 schema 标识与版本指纹。 | |
| 721 | +- 当前真正参与 prompt 输入的字段是:`title`、`brief`、`description`;这些字段任一变化,都会落到新的缓存 key。 | |
| 722 | +- `prompt/schema 版本指纹` 会综合 system prompt、shared instruction、localized table headers、result fields、user instruction template 等信息生成;因此只要提示词或输出契约变化,旧缓存会自然失效。 | |
| 693 | 723 | - `tenant_id`、`spu_id` 只用于请求归属与结果回填,不参与缓存键。 |
| 694 | -- 因此,输入内容不变时可跨请求直接命中缓存;任一输入字段变化时,会自然落到新的缓存 key。 | |
| 724 | +- 因此,输入内容与 prompt 契约都不变时可跨请求直接命中缓存;任一一侧变化,都会自然落到新的缓存 key。 | |
| 695 | 725 | |
| 696 | 726 | 语言说明: |
| 697 | 727 | |
| 698 | 728 | - 接口不接受语言控制参数。 |
| 699 | 729 | - 返回哪些语言、返回哪些语义维度,统一由 `indexer.product_enrich` 内部逻辑决定。 |
| 700 | -- 当前为了与 `search_products` mapping 对齐,返回结果只包含核心索引语言 `zh`、`en`。 | |
| 730 | +- 当前为了与 `search_products` mapping 对齐,通用增强字段与 taxonomy 字段都统一只返回核心索引语言 `zh`、`en`。 | |
| 701 | 731 | |
| 702 | 732 | 批量请求建议: |
| 703 | 733 | - **全量**:强烈建议 尽可能 **20 个 SPU/doc** 攒成一个批次后再请求一次。 |
| ... | ... | @@ -709,6 +739,8 @@ curl -X POST "http://127.0.0.1:6004/indexer/build-docs-from-db" \ |
| 709 | 739 | ```json |
| 710 | 740 | { |
| 711 | 741 | "tenant_id": "170", |
| 742 | + "enrichment_scopes": ["generic", "category_taxonomy"], | |
| 743 | + "category_taxonomy_profile": "apparel", | |
| 712 | 744 | "total": 2, |
| 713 | 745 | "results": [ |
| 714 | 746 | { |
| ... | ... | @@ -725,6 +757,11 @@ curl -X POST "http://127.0.0.1:6004/indexer/build-docs-from-db" \ |
| 725 | 757 | { "name": "enriched_tags", "value": { "zh": "纯棉" } }, |
| 726 | 758 | { "name": "usage_scene", "value": { "zh": "日常" } }, |
| 727 | 759 | { "name": "enriched_tags", "value": { "en": "cotton" } } |
| 760 | + ], | |
| 761 | + "enriched_taxonomy_attributes": [ | |
| 762 | + { "name": "Product Type", "value": { "zh": ["T恤"], "en": ["t-shirt"] } }, | |
| 763 | + { "name": "Target Gender", "value": { "zh": ["男"], "en": ["men"] } }, | |
| 764 | + { "name": "Season", "value": { "zh": ["夏季"], "en": ["summer"] } } | |
| 728 | 765 | ] |
| 729 | 766 | }, |
| 730 | 767 | { |
| ... | ... | @@ -735,7 +772,8 @@ curl -X POST "http://127.0.0.1:6004/indexer/build-docs-from-db" \ |
| 735 | 772 | "enriched_tags": { |
| 736 | 773 | "en": ["dolls", "toys"] |
| 737 | 774 | }, |
| 738 | - "enriched_attributes": [] | |
| 775 | + "enriched_attributes": [], | |
| 776 | + "enriched_taxonomy_attributes": [] | |
| 739 | 777 | } |
| 740 | 778 | ] |
| 741 | 779 | } |
| ... | ... | @@ -743,10 +781,13 @@ curl -X POST "http://127.0.0.1:6004/indexer/build-docs-from-db" \ |
| 743 | 781 | |
| 744 | 782 | | 字段 | 类型 | 说明 | |
| 745 | 783 | |------|------|------| |
| 746 | -| `results` | array | 与请求 `items` 一一对应,每项含 `spu_id`、`qanchors`、`enriched_attributes`、`enriched_tags` | | |
| 784 | +| `enrichment_scopes` | array | 实际执行的增强范围列表 | | |
| 785 | +| `category_taxonomy_profile` | string | 实际使用的品类 taxonomy profile | | |
| 786 | +| `results` | array | 与请求 `items` 一一对应,每项含 `spu_id`、`qanchors`、`enriched_attributes`、`enriched_tags`、`enriched_taxonomy_attributes` | | |
| 747 | 787 | | `results[].qanchors` | object | 与 ES `qanchors` 字段同结构,按语言键返回短语数组 | |
| 748 | 788 | | `results[].enriched_tags` | object | 与 ES `enriched_tags` 字段同结构,按语言键返回标签数组 | |
| 749 | 789 | | `results[].enriched_attributes` | array | 与 ES `enriched_attributes` nested 字段同结构,每项为 `{ "name", "value": { "zh"?: "...", "en"?: "..." } }` | |
| 790 | +| `results[].enriched_taxonomy_attributes` | array | 与 ES `enriched_taxonomy_attributes` nested 字段同结构。每项通常为 `{ "name", "value": { "zh"?: [...], "en"?: [...] } }` | | |
| 750 | 791 | | `results[].error` | string | 若该条处理失败(如 LLM 异常),会在此字段返回错误信息 | |
| 751 | 792 | |
| 752 | 793 | **错误响应**: |
| ... | ... | @@ -756,10 +797,12 @@ curl -X POST "http://127.0.0.1:6004/indexer/build-docs-from-db" \ |
| 756 | 797 | #### 请求示例 |
| 757 | 798 | |
| 758 | 799 | ```bash |
| 759 | -curl -X POST "http://localhost:6004/indexer/enrich-content" \ | |
| 800 | +curl -X POST "http://localhost:6001/indexer/enrich-content" \ | |
| 760 | 801 | -H "Content-Type: application/json" \ |
| 761 | 802 | -d '{ |
| 762 | - "tenant_id": "170", | |
| 803 | + "tenant_id": "163", | |
| 804 | + "enrichment_scopes": ["generic", "category_taxonomy"], | |
| 805 | + "category_taxonomy_profile": "apparel", | |
| 763 | 806 | "items": [ |
| 764 | 807 | { |
| 765 | 808 | "spu_id": "223167", |
| ... | ... | @@ -773,4 +816,3 @@ curl -X POST "http://localhost:6004/indexer/enrich-content" \ |
| 773 | 816 | ``` |
| 774 | 817 | |
| 775 | 818 | --- |
| 776 | - | ... | ... |
docs/搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md
| ... | ... | @@ -444,7 +444,7 @@ curl "http://localhost:6006/health" |
| 444 | 444 | |
| 445 | 445 | - **Base URL**: Indexer 服务地址,如 `http://localhost:6004` |
| 446 | 446 | - **路径**: `POST /indexer/enrich-content` |
| 447 | -- **说明**: 根据商品标题批量生成 `qanchors`、`enriched_attributes`、`tags`,用于拼装 ES 文档。内部使用大模型(需配置 `DASHSCOPE_API_KEY`),支持多语言与 Redis 缓存;单次最多 50 条,建议批量调用以提升效率。 | |
| 447 | +- **说明**: 根据商品标题批量生成 `qanchors`、`enriched_attributes`、`enriched_tags`、`enriched_taxonomy_attributes`,用于拼装 ES 文档。支持通过 `enrichment_scopes` 选择执行 `generic` / `category_taxonomy`,并通过 `category_taxonomy_profile` 选择对应大类的 taxonomy prompt/profile;默认执行 `generic + category_taxonomy(apparel)`。当前支持的 taxonomy profile 包括 `apparel`、`3c`、`bags`、`pet_supplies`、`electronics`、`outdoor`、`home_appliances`、`home_living`、`wigs`、`beauty`、`accessories`、`toys`、`shoes`、`sports`、`others`。所有 profile 的 taxonomy 输出都统一返回 `zh` + `en`,`category_taxonomy_profile` 只决定字段集合。内部使用大模型(需配置 `DASHSCOPE_API_KEY`),支持多语言与 Redis 缓存;单次最多 50 条,建议批量调用以提升效率。 | |
| 448 | 448 | |
| 449 | 449 | 请求/响应格式、示例及错误码见 [-05-索引接口(Indexer)](./搜索API对接指南-05-索引接口(Indexer).md#58-内容理解字段生成接口)。 |
| 450 | 450 | ... | ... |
docs/搜索API对接指南-10-接口级压测脚本.md
| ... | ... | @@ -4,7 +4,7 @@ |
| 4 | 4 | |
| 5 | 5 | ## 10. 接口级压测脚本 |
| 6 | 6 | |
| 7 | -仓库提供统一压测脚本:`scripts/perf_api_benchmark.py`,用于对以下接口做并发压测: | |
| 7 | +仓库提供统一压测脚本:`benchmarks/perf_api_benchmark.py`,用于对以下接口做并发压测: | |
| 8 | 8 | |
| 9 | 9 | - 后端搜索:`POST /search/` |
| 10 | 10 | - 搜索建议:`GET /search/suggestions` |
| ... | ... | @@ -18,21 +18,21 @@ |
| 18 | 18 | |
| 19 | 19 | ```bash |
| 20 | 20 | # suggest 压测(tenant 162) |
| 21 | -python scripts/perf_api_benchmark.py \ | |
| 21 | +python benchmarks/perf_api_benchmark.py \ | |
| 22 | 22 | --scenario backend_suggest \ |
| 23 | 23 | --tenant-id 162 \ |
| 24 | 24 | --duration 30 \ |
| 25 | 25 | --concurrency 50 |
| 26 | 26 | |
| 27 | 27 | # search 压测 |
| 28 | -python scripts/perf_api_benchmark.py \ | |
| 28 | +python benchmarks/perf_api_benchmark.py \ | |
| 29 | 29 | --scenario backend_search \ |
| 30 | 30 | --tenant-id 162 \ |
| 31 | 31 | --duration 30 \ |
| 32 | 32 | --concurrency 20 |
| 33 | 33 | |
| 34 | 34 | # 全链路压测(search + suggest + embedding + translate + rerank) |
| 35 | -python scripts/perf_api_benchmark.py \ | |
| 35 | +python benchmarks/perf_api_benchmark.py \ | |
| 36 | 36 | --scenario all \ |
| 37 | 37 | --tenant-id 162 \ |
| 38 | 38 | --duration 60 \ |
| ... | ... | @@ -45,17 +45,16 @@ python scripts/perf_api_benchmark.py \ |
| 45 | 45 | 可通过 `--cases-file` 覆盖默认请求模板。示例文件: |
| 46 | 46 | |
| 47 | 47 | ```bash |
| 48 | -scripts/perf_cases.json.example | |
| 48 | +benchmarks/perf_cases.json.example | |
| 49 | 49 | ``` |
| 50 | 50 | |
| 51 | 51 | 执行示例: |
| 52 | 52 | |
| 53 | 53 | ```bash |
| 54 | -python scripts/perf_api_benchmark.py \ | |
| 54 | +python benchmarks/perf_api_benchmark.py \ | |
| 55 | 55 | --scenario all \ |
| 56 | 56 | --tenant-id 162 \ |
| 57 | - --cases-file scripts/perf_cases.json.example \ | |
| 57 | + --cases-file benchmarks/perf_cases.json.example \ | |
| 58 | 58 | --duration 60 \ |
| 59 | 59 | --concurrency 40 |
| 60 | 60 | ``` |
| 61 | - | ... | ... |
docs/相关性检索优化说明.md
| ... | ... | @@ -330,7 +330,7 @@ python -m pytest -q tests/test_rerank_client.py tests/test_es_query_builder.py t |
| 330 | 330 | ./scripts/service_ctl.sh restart backend |
| 331 | 331 | sleep 3 |
| 332 | 332 | ./scripts/service_ctl.sh status backend |
| 333 | -./scripts/evaluation/start_eval.sh.sh batch | |
| 333 | +./scripts/evaluation/start_eval.sh batch | |
| 334 | 334 | ``` |
| 335 | 335 | |
| 336 | 336 | 评估产物在 `artifacts/search_evaluation/`(如 `search_eval.sqlite3`、`batch_reports/` 下的 JSON/Markdown)。流程与参数说明见 [scripts/evaluation/README.md](../scripts/evaluation/README.md)。 |
| ... | ... | @@ -895,4 +895,3 @@ rerank_score:0.4784 |
| 895 | 895 | rerank_score:0.5849 |
| 896 | 896 | "zh": "新款女士修身仿旧牛仔短裤 – 休闲性感磨边水洗牛仔短裤,时尚舒", |
| 897 | 897 | "en": "New Women's Slim-fit Vintage Washed Denim Shorts – Casual Sexy Frayed Hem, Fashionable & Comfortable" |
| 898 | - | ... | ... |
docs/缓存与Redis使用说明.md
| ... | ... | @@ -196,18 +196,25 @@ services: |
| 196 | 196 | - 配置项: |
| 197 | 197 | - `ANCHOR_CACHE_PREFIX = REDIS_CONFIG.get("anchor_cache_prefix", "product_anchors")` |
| 198 | 198 | - `ANCHOR_CACHE_EXPIRE_DAYS = int(REDIS_CONFIG.get("anchor_cache_expire_days", 30))` |
| 199 | -- Key 构造函数:`_make_anchor_cache_key(title, target_lang, tenant_id)` | |
| 199 | +- Key 构造函数:`_make_analysis_cache_key(product, target_lang, analysis_kind)` | |
| 200 | 200 | - 模板: |
| 201 | 201 | |
| 202 | 202 | ```text |
| 203 | -{ANCHOR_CACHE_PREFIX}:{tenant_or_global}:{target_lang}:{md5(title)} | |
| 203 | +{ANCHOR_CACHE_PREFIX}:{analysis_kind}:{prompt_contract_hash}:{target_lang}:{prompt_input_prefix}{md5(prompt_input)} | |
| 204 | 204 | ``` |
| 205 | 205 | |
| 206 | 206 | - 字段说明: |
| 207 | 207 | - `ANCHOR_CACHE_PREFIX`:默认 `"product_anchors"`,可通过 `.env` 中的 `REDIS_ANCHOR_CACHE_PREFIX`(若存在)间接配置到 `REDIS_CONFIG`; |
| 208 | - - `tenant_or_global`:`tenant_id` 去空白后的字符串,若为空则使用 `"global"`; | |
| 208 | + - `analysis_kind`:分析族,目前至少包括 `content` 与 `taxonomy`,两者缓存隔离; | |
| 209 | + - `prompt_contract_hash`:基于 system prompt、shared instruction、localized headers、result fields、user instruction template、schema cache version 等生成的短 hash;只要提示词或输出契约变化,缓存会自动失效; | |
| 209 | 210 | - `target_lang`:内容理解输出语言,例如 `zh`; |
| 210 | - - `md5(title)`:对原始商品标题(UTF-8)做 MD5。 | |
| 211 | + - `prompt_input_prefix + md5(prompt_input)`:对真正送入 prompt 的商品文本做前缀 + MD5;当前 prompt 输入来自 `title`、`brief`、`description` 的规范化拼接结果。 | |
| 212 | + | |
| 213 | +设计原则: | |
| 214 | + | |
| 215 | +- 只让**实际影响 LLM 输出**的输入参与 key; | |
| 216 | +- 不让 `tenant_id`、`spu_id` 这类“结果归属信息”污染缓存; | |
| 217 | +- prompt 或 schema 变更时,不依赖人工清理 Redis,也能自然切换到新 key。 | |
| 211 | 218 | |
| 212 | 219 | ### 4.2 Value 与类型 |
| 213 | 220 | |
| ... | ... | @@ -229,6 +236,7 @@ services: |
| 229 | 236 | ``` |
| 230 | 237 | |
| 231 | 238 | - 读取时通过 `json.loads(raw)` 还原为 `Dict[str, Any]`。 |
| 239 | +- `content` 与 `taxonomy` 的 value 结构会随各自 schema 不同而不同,但都会先通过统一的 normalize 逻辑再写缓存。 | |
| 232 | 240 | |
| 233 | 241 | ### 4.3 过期策略 |
| 234 | 242 | ... | ... |
embeddings/README.md
| ... | ... | @@ -98,10 +98,10 @@ |
| 98 | 98 | |
| 99 | 99 | ### 性能与压测(沿用仓库脚本) |
| 100 | 100 | |
| 101 | -- 接口级压测(与 `perf_reports/2026-03-12/matrix_report/` 等方法一致):`scripts/perf_api_benchmark.py` | |
| 102 | - - 示例:`python scripts/perf_api_benchmark.py --scenario embed_text --duration 30 --concurrency 20` | |
| 101 | +- 接口级压测(与 `perf_reports/2026-03-12/matrix_report/` 等方法一致):`benchmarks/perf_api_benchmark.py` | |
| 102 | + - 示例:`python benchmarks/perf_api_benchmark.py --scenario embed_text --duration 30 --concurrency 20` | |
| 103 | 103 | - 文本/图片向量可带 `priority`(与线上 admission 语义一致):`--embed-text-priority 1`、`--embed-image-priority 1` |
| 104 | - - 自定义请求模板:`--cases-file scripts/perf_cases.json.example` | |
| 104 | + - 自定义请求模板:`--cases-file benchmarks/perf_cases.json.example` | |
| 105 | 105 | - 历史矩阵结果与说明见 `perf_reports/2026-03-12/matrix_report/summary.md`。 |
| 106 | 106 | |
| 107 | 107 | ### 启动服务 | ... | ... |
frontend/static/js/app.js
| ... | ... | @@ -316,7 +316,10 @@ async function performSearch(page = 1) { |
| 316 | 316 | document.getElementById('productGrid').innerHTML = ''; |
| 317 | 317 | |
| 318 | 318 | try { |
| 319 | - const response = await fetch(`${API_BASE_URL}/search/`, { | |
| 319 | + const searchUrl = new URL(`${API_BASE_URL}/search/`, window.location.origin); | |
| 320 | + searchUrl.searchParams.set('tenant_id', tenantId); | |
| 321 | + | |
| 322 | + const response = await fetch(searchUrl.toString(), { | |
| 320 | 323 | method: 'POST', |
| 321 | 324 | headers: { |
| 322 | 325 | 'Content-Type': 'application/json', | ... | ... |
indexer/README.md
indexer/document_transformer.py
| ... | ... | @@ -242,6 +242,7 @@ class SPUDocumentTransformer: |
| 242 | 242 | - qanchors.{lang} |
| 243 | 243 | - enriched_tags.{lang} |
| 244 | 244 | - enriched_attributes[].value.{lang} |
| 245 | + - enriched_taxonomy_attributes[].value.{lang} | |
| 245 | 246 | |
| 246 | 247 | 设计目标: |
| 247 | 248 | - 尽可能攒批调用 LLM; |
| ... | ... | @@ -273,7 +274,12 @@ class SPUDocumentTransformer: |
| 273 | 274 | |
| 274 | 275 | tenant_id = str(docs[0].get("tenant_id") or "").strip() or None |
| 275 | 276 | try: |
| 276 | - results = build_index_content_fields(items=items, tenant_id=tenant_id) | |
| 277 | + # TODO: 从数据库读取该 tenant 的真实行业,并据此替换当前默认的 apparel profile。 | |
| 278 | + results = build_index_content_fields( | |
| 279 | + items=items, | |
| 280 | + tenant_id=tenant_id, | |
| 281 | + category_taxonomy_profile="apparel", | |
| 282 | + ) | |
| 277 | 283 | except Exception as e: |
| 278 | 284 | logger.warning("LLM batch attribute fill failed: %s", e) |
| 279 | 285 | return |
| ... | ... | @@ -296,6 +302,8 @@ class SPUDocumentTransformer: |
| 296 | 302 | doc["enriched_tags"] = enrichment["enriched_tags"] |
| 297 | 303 | if enrichment.get("enriched_attributes"): |
| 298 | 304 | doc["enriched_attributes"] = enrichment["enriched_attributes"] |
| 305 | + if enrichment.get("enriched_taxonomy_attributes"): | |
| 306 | + doc["enriched_taxonomy_attributes"] = enrichment["enriched_taxonomy_attributes"] | |
| 299 | 307 | except Exception as e: |
| 300 | 308 | logger.warning("Failed to apply enrichment to doc (spu_id=%s): %s", doc.get("spu_id"), e) |
| 301 | 309 | |
| ... | ... | @@ -666,6 +674,7 @@ class SPUDocumentTransformer: |
| 666 | 674 | |
| 667 | 675 | tenant_id = doc.get("tenant_id") |
| 668 | 676 | try: |
| 677 | + # TODO: 从数据库读取该 tenant 的真实行业,并据此替换当前默认的 apparel profile。 | |
| 669 | 678 | results = build_index_content_fields( |
| 670 | 679 | items=[ |
| 671 | 680 | { |
| ... | ... | @@ -677,6 +686,7 @@ class SPUDocumentTransformer: |
| 677 | 686 | } |
| 678 | 687 | ], |
| 679 | 688 | tenant_id=str(tenant_id), |
| 689 | + category_taxonomy_profile="apparel", | |
| 680 | 690 | ) |
| 681 | 691 | except Exception as e: |
| 682 | 692 | logger.warning("LLM attribute fill failed for SPU %s: %s", spu_id, e) | ... | ... |
indexer/product_enrich.py
| ... | ... | @@ -14,10 +14,11 @@ import time |
| 14 | 14 | import hashlib |
| 15 | 15 | import uuid |
| 16 | 16 | import threading |
| 17 | +from dataclasses import dataclass, field | |
| 17 | 18 | from collections import OrderedDict |
| 18 | 19 | from datetime import datetime |
| 19 | 20 | from concurrent.futures import ThreadPoolExecutor |
| 20 | -from typing import List, Dict, Tuple, Any, Optional | |
| 21 | +from typing import List, Dict, Tuple, Any, Optional, FrozenSet | |
| 21 | 22 | |
| 22 | 23 | import redis |
| 23 | 24 | import requests |
| ... | ... | @@ -30,6 +31,7 @@ from indexer.product_enrich_prompts import ( |
| 30 | 31 | USER_INSTRUCTION_TEMPLATE, |
| 31 | 32 | LANGUAGE_MARKDOWN_TABLE_HEADERS, |
| 32 | 33 | SHARED_ANALYSIS_INSTRUCTION, |
| 34 | + CATEGORY_TAXONOMY_PROFILES, | |
| 33 | 35 | ) |
| 34 | 36 | |
| 35 | 37 | # 配置 |
| ... | ... | @@ -144,10 +146,26 @@ if _missing_prompt_langs: |
| 144 | 146 | ) |
| 145 | 147 | |
| 146 | 148 | |
| 147 | -# 多值字段分隔:英文逗号、中文逗号、顿号,及历史约定的 ; | / 与空白 | |
| 149 | +# 多值字段分隔 | |
| 148 | 150 | _MULTI_VALUE_FIELD_SPLIT_RE = re.compile(r"[,、,;|/\n\t]+") |
| 151 | +# 表格单元格中视为「无内容」的占位 | |
| 152 | +_MARKDOWN_EMPTY_CELL_LITERALS: Tuple[str, ...] = ("-","–", "—", "none", "null", "n/a", "无") | |
| 153 | +_MARKDOWN_EMPTY_CELL_TOKENS_CF: FrozenSet[str] = frozenset( | |
| 154 | + lit.casefold() for lit in _MARKDOWN_EMPTY_CELL_LITERALS | |
| 155 | +) | |
| 156 | + | |
| 157 | +def _normalize_markdown_table_cell(raw: Optional[str]) -> str: | |
| 158 | + """strip;将占位符统一视为空字符串。""" | |
| 159 | + s = str(raw or "").strip() | |
| 160 | + if not s: | |
| 161 | + return "" | |
| 162 | + if s.casefold() in _MARKDOWN_EMPTY_CELL_TOKENS_CF: | |
| 163 | + return "" | |
| 164 | + return s | |
| 149 | 165 | _CORE_INDEX_LANGUAGES = ("zh", "en") |
| 150 | -_ANALYSIS_ATTRIBUTE_FIELD_MAP = ( | |
| 166 | +_DEFAULT_ENRICHMENT_SCOPES = ("generic", "category_taxonomy") | |
| 167 | +_DEFAULT_CATEGORY_TAXONOMY_PROFILE = "apparel" | |
| 168 | +_CONTENT_ANALYSIS_ATTRIBUTE_FIELD_MAP = ( | |
| 151 | 169 | ("tags", "enriched_tags"), |
| 152 | 170 | ("target_audience", "target_audience"), |
| 153 | 171 | ("usage_scene", "usage_scene"), |
| ... | ... | @@ -156,7 +174,7 @@ _ANALYSIS_ATTRIBUTE_FIELD_MAP = ( |
| 156 | 174 | ("material", "material"), |
| 157 | 175 | ("features", "features"), |
| 158 | 176 | ) |
| 159 | -_ANALYSIS_RESULT_FIELDS = ( | |
| 177 | +_CONTENT_ANALYSIS_RESULT_FIELDS = ( | |
| 160 | 178 | "title", |
| 161 | 179 | "category_path", |
| 162 | 180 | "tags", |
| ... | ... | @@ -168,7 +186,7 @@ _ANALYSIS_RESULT_FIELDS = ( |
| 168 | 186 | "features", |
| 169 | 187 | "anchor_text", |
| 170 | 188 | ) |
| 171 | -_ANALYSIS_MEANINGFUL_FIELDS = ( | |
| 189 | +_CONTENT_ANALYSIS_MEANINGFUL_FIELDS = ( | |
| 172 | 190 | "tags", |
| 173 | 191 | "target_audience", |
| 174 | 192 | "usage_scene", |
| ... | ... | @@ -178,9 +196,111 @@ _ANALYSIS_MEANINGFUL_FIELDS = ( |
| 178 | 196 | "features", |
| 179 | 197 | "anchor_text", |
| 180 | 198 | ) |
| 181 | -_ANALYSIS_FIELD_ALIASES = { | |
| 199 | +_CONTENT_ANALYSIS_FIELD_ALIASES = { | |
| 182 | 200 | "tags": ("tags", "enriched_tags"), |
| 183 | 201 | } |
| 202 | +_CONTENT_ANALYSIS_QUALITY_FIELDS = ("title", "category_path", "anchor_text") | |
| 203 | + | |
| 204 | + | |
| 205 | +@dataclass(frozen=True) | |
| 206 | +class AnalysisSchema: | |
| 207 | + name: str | |
| 208 | + shared_instruction: str | |
| 209 | + markdown_table_headers: Dict[str, List[str]] | |
| 210 | + result_fields: Tuple[str, ...] | |
| 211 | + meaningful_fields: Tuple[str, ...] | |
| 212 | + cache_version: str = "v1" | |
| 213 | + field_aliases: Dict[str, Tuple[str, ...]] = field(default_factory=dict) | |
| 214 | + quality_fields: Tuple[str, ...] = () | |
| 215 | + | |
| 216 | + def get_headers(self, target_lang: str) -> Optional[List[str]]: | |
| 217 | + return self.markdown_table_headers.get(target_lang) | |
| 218 | + | |
| 219 | + | |
| 220 | +_ANALYSIS_SCHEMAS: Dict[str, AnalysisSchema] = { | |
| 221 | + "content": AnalysisSchema( | |
| 222 | + name="content", | |
| 223 | + shared_instruction=SHARED_ANALYSIS_INSTRUCTION, | |
| 224 | + markdown_table_headers=LANGUAGE_MARKDOWN_TABLE_HEADERS, | |
| 225 | + result_fields=_CONTENT_ANALYSIS_RESULT_FIELDS, | |
| 226 | + meaningful_fields=_CONTENT_ANALYSIS_MEANINGFUL_FIELDS, | |
| 227 | + cache_version="v2", | |
| 228 | + field_aliases=_CONTENT_ANALYSIS_FIELD_ALIASES, | |
| 229 | + quality_fields=_CONTENT_ANALYSIS_QUALITY_FIELDS, | |
| 230 | + ), | |
| 231 | +} | |
| 232 | + | |
| 233 | +def _build_taxonomy_profile_schema(profile: str, config: Dict[str, Any]) -> AnalysisSchema: | |
| 234 | + return AnalysisSchema( | |
| 235 | + name=f"taxonomy:{profile}", | |
| 236 | + shared_instruction=config["shared_instruction"], | |
| 237 | + markdown_table_headers=config["markdown_table_headers"], | |
| 238 | + result_fields=tuple(field["key"] for field in config["fields"]), | |
| 239 | + meaningful_fields=tuple(field["key"] for field in config["fields"]), | |
| 240 | + cache_version="v1", | |
| 241 | + ) | |
| 242 | + | |
| 243 | + | |
| 244 | +_CATEGORY_TAXONOMY_PROFILE_SCHEMAS: Dict[str, AnalysisSchema] = { | |
| 245 | + profile: _build_taxonomy_profile_schema(profile, config) | |
| 246 | + for profile, config in CATEGORY_TAXONOMY_PROFILES.items() | |
| 247 | +} | |
| 248 | + | |
| 249 | +_CATEGORY_TAXONOMY_PROFILE_ATTRIBUTE_FIELD_MAPS: Dict[str, Tuple[Tuple[str, str], ...]] = { | |
| 250 | + profile: tuple((field["key"], field["label"]) for field in config["fields"]) | |
| 251 | + for profile, config in CATEGORY_TAXONOMY_PROFILES.items() | |
| 252 | +} | |
| 253 | + | |
| 254 | + | |
| 255 | +def get_supported_category_taxonomy_profiles() -> Tuple[str, ...]: | |
| 256 | + return tuple(_CATEGORY_TAXONOMY_PROFILE_SCHEMAS.keys()) | |
| 257 | + | |
| 258 | + | |
| 259 | +def _normalize_category_taxonomy_profile(category_taxonomy_profile: Optional[str] = None) -> str: | |
| 260 | + profile = str(category_taxonomy_profile or _DEFAULT_CATEGORY_TAXONOMY_PROFILE).strip() | |
| 261 | + if profile not in _CATEGORY_TAXONOMY_PROFILE_SCHEMAS: | |
| 262 | + supported = ", ".join(get_supported_category_taxonomy_profiles()) | |
| 263 | + raise ValueError( | |
| 264 | + f"Unsupported category_taxonomy_profile: {profile}. Supported profiles: {supported}" | |
| 265 | + ) | |
| 266 | + return profile | |
| 267 | + | |
| 268 | + | |
| 269 | +def _get_analysis_schema( | |
| 270 | + analysis_kind: str, | |
| 271 | + *, | |
| 272 | + category_taxonomy_profile: Optional[str] = None, | |
| 273 | +) -> AnalysisSchema: | |
| 274 | + if analysis_kind == "content": | |
| 275 | + return _ANALYSIS_SCHEMAS["content"] | |
| 276 | + if analysis_kind == "taxonomy": | |
| 277 | + profile = _normalize_category_taxonomy_profile(category_taxonomy_profile) | |
| 278 | + return _CATEGORY_TAXONOMY_PROFILE_SCHEMAS[profile] | |
| 279 | + raise ValueError(f"Unsupported analysis_kind: {analysis_kind}") | |
| 280 | + | |
| 281 | + | |
| 282 | +def _get_taxonomy_attribute_field_map( | |
| 283 | + category_taxonomy_profile: Optional[str] = None, | |
| 284 | +) -> Tuple[Tuple[str, str], ...]: | |
| 285 | + profile = _normalize_category_taxonomy_profile(category_taxonomy_profile) | |
| 286 | + return _CATEGORY_TAXONOMY_PROFILE_ATTRIBUTE_FIELD_MAPS[profile] | |
| 287 | + | |
| 288 | + | |
| 289 | +def _normalize_enrichment_scopes( | |
| 290 | + enrichment_scopes: Optional[List[str]] = None, | |
| 291 | +) -> Tuple[str, ...]: | |
| 292 | + requested = _DEFAULT_ENRICHMENT_SCOPES if not enrichment_scopes else tuple(enrichment_scopes) | |
| 293 | + normalized: List[str] = [] | |
| 294 | + seen = set() | |
| 295 | + for enrichment_scope in requested: | |
| 296 | + scope = str(enrichment_scope).strip() | |
| 297 | + if scope not in {"generic", "category_taxonomy"}: | |
| 298 | + raise ValueError(f"Unsupported enrichment_scope: {scope}") | |
| 299 | + if scope in seen: | |
| 300 | + continue | |
| 301 | + seen.add(scope) | |
| 302 | + normalized.append(scope) | |
| 303 | + return tuple(normalized) | |
| 184 | 304 | |
| 185 | 305 | |
| 186 | 306 | def split_multi_value_field(text: Optional[str]) -> List[str]: |
| ... | ... | @@ -235,12 +355,12 @@ def _get_product_id(product: Dict[str, Any]) -> str: |
| 235 | 355 | return str(product.get("id") or product.get("spu_id") or "").strip() |
| 236 | 356 | |
| 237 | 357 | |
| 238 | -def _get_analysis_field_aliases(field_name: str) -> Tuple[str, ...]: | |
| 239 | - return _ANALYSIS_FIELD_ALIASES.get(field_name, (field_name,)) | |
| 358 | +def _get_analysis_field_aliases(field_name: str, schema: AnalysisSchema) -> Tuple[str, ...]: | |
| 359 | + return schema.field_aliases.get(field_name, (field_name,)) | |
| 240 | 360 | |
| 241 | 361 | |
| 242 | -def _get_analysis_field_value(row: Dict[str, Any], field_name: str) -> Any: | |
| 243 | - for alias in _get_analysis_field_aliases(field_name): | |
| 362 | +def _get_analysis_field_value(row: Dict[str, Any], field_name: str, schema: AnalysisSchema) -> Any: | |
| 363 | + for alias in _get_analysis_field_aliases(field_name, schema): | |
| 244 | 364 | if alias in row: |
| 245 | 365 | return row.get(alias) |
| 246 | 366 | return None |
| ... | ... | @@ -261,6 +381,7 @@ def _has_meaningful_value(value: Any) -> bool: |
| 261 | 381 | def _make_empty_analysis_result( |
| 262 | 382 | product: Dict[str, Any], |
| 263 | 383 | target_lang: str, |
| 384 | + schema: AnalysisSchema, | |
| 264 | 385 | error: Optional[str] = None, |
| 265 | 386 | ) -> Dict[str, Any]: |
| 266 | 387 | result = { |
| ... | ... | @@ -268,7 +389,7 @@ def _make_empty_analysis_result( |
| 268 | 389 | "lang": target_lang, |
| 269 | 390 | "title_input": str(product.get("title") or "").strip(), |
| 270 | 391 | } |
| 271 | - for field in _ANALYSIS_RESULT_FIELDS: | |
| 392 | + for field in schema.result_fields: | |
| 272 | 393 | result[field] = "" |
| 273 | 394 | if error: |
| 274 | 395 | result["error"] = error |
| ... | ... | @@ -279,42 +400,59 @@ def _normalize_analysis_result( |
| 279 | 400 | result: Dict[str, Any], |
| 280 | 401 | product: Dict[str, Any], |
| 281 | 402 | target_lang: str, |
| 403 | + schema: AnalysisSchema, | |
| 282 | 404 | ) -> Dict[str, Any]: |
| 283 | - normalized = _make_empty_analysis_result(product, target_lang) | |
| 405 | + normalized = _make_empty_analysis_result(product, target_lang, schema) | |
| 284 | 406 | if not isinstance(result, dict): |
| 285 | 407 | return normalized |
| 286 | 408 | |
| 287 | 409 | normalized["lang"] = str(result.get("lang") or target_lang).strip() or target_lang |
| 288 | - normalized["title"] = str(result.get("title") or "").strip() | |
| 289 | - normalized["category_path"] = str(result.get("category_path") or "").strip() | |
| 290 | 410 | normalized["title_input"] = str( |
| 291 | 411 | product.get("title") or result.get("title_input") or "" |
| 292 | 412 | ).strip() |
| 293 | 413 | |
| 294 | - for field in _ANALYSIS_RESULT_FIELDS: | |
| 295 | - if field in {"title", "category_path"}: | |
| 296 | - continue | |
| 297 | - normalized[field] = str(_get_analysis_field_value(result, field) or "").strip() | |
| 414 | + for field in schema.result_fields: | |
| 415 | + normalized[field] = str(_get_analysis_field_value(result, field, schema) or "").strip() | |
| 298 | 416 | |
| 299 | 417 | if result.get("error"): |
| 300 | 418 | normalized["error"] = str(result.get("error")) |
| 301 | 419 | return normalized |
| 302 | 420 | |
| 303 | 421 | |
| 304 | -def _has_meaningful_analysis_content(result: Dict[str, Any]) -> bool: | |
| 305 | - return any(_has_meaningful_value(result.get(field)) for field in _ANALYSIS_MEANINGFUL_FIELDS) | |
| 422 | +def _has_meaningful_analysis_content(result: Dict[str, Any], schema: AnalysisSchema) -> bool: | |
| 423 | + return any(_has_meaningful_value(result.get(field)) for field in schema.meaningful_fields) | |
| 424 | + | |
| 425 | + | |
| 426 | +def _append_analysis_attributes( | |
| 427 | + target: List[Dict[str, Any]], | |
| 428 | + row: Dict[str, Any], | |
| 429 | + lang: str, | |
| 430 | + schema: AnalysisSchema, | |
| 431 | + field_map: Tuple[Tuple[str, str], ...], | |
| 432 | +) -> None: | |
| 433 | + for source_name, output_name in field_map: | |
| 434 | + raw = _get_analysis_field_value(row, source_name, schema) | |
| 435 | + if not raw: | |
| 436 | + continue | |
| 437 | + _append_named_lang_phrase_map( | |
| 438 | + target, | |
| 439 | + name=output_name, | |
| 440 | + lang=lang, | |
| 441 | + raw_value=raw, | |
| 442 | + ) | |
| 306 | 443 | |
| 307 | 444 | |
| 308 | 445 | def _apply_index_content_row(result: Dict[str, Any], row: Dict[str, Any], lang: str) -> None: |
| 309 | 446 | if not row or row.get("error"): |
| 310 | 447 | return |
| 311 | 448 | |
| 312 | - anchor_text = str(_get_analysis_field_value(row, "anchor_text") or "").strip() | |
| 449 | + content_schema = _get_analysis_schema("content") | |
| 450 | + anchor_text = str(_get_analysis_field_value(row, "anchor_text", content_schema) or "").strip() | |
| 313 | 451 | if anchor_text: |
| 314 | 452 | _append_lang_phrase_map(result["qanchors"], lang=lang, raw_value=anchor_text) |
| 315 | 453 | |
| 316 | - for source_name, output_name in _ANALYSIS_ATTRIBUTE_FIELD_MAP: | |
| 317 | - raw = _get_analysis_field_value(row, source_name) | |
| 454 | + for source_name, output_name in _CONTENT_ANALYSIS_ATTRIBUTE_FIELD_MAP: | |
| 455 | + raw = _get_analysis_field_value(row, source_name, content_schema) | |
| 318 | 456 | if not raw: |
| 319 | 457 | continue |
| 320 | 458 | _append_named_lang_phrase_map( |
| ... | ... | @@ -327,6 +465,28 @@ def _apply_index_content_row(result: Dict[str, Any], row: Dict[str, Any], lang: |
| 327 | 465 | _append_lang_phrase_map(result["enriched_tags"], lang=lang, raw_value=raw) |
| 328 | 466 | |
| 329 | 467 | |
| 468 | +def _apply_index_taxonomy_row( | |
| 469 | + result: Dict[str, Any], | |
| 470 | + row: Dict[str, Any], | |
| 471 | + lang: str, | |
| 472 | + *, | |
| 473 | + category_taxonomy_profile: Optional[str] = None, | |
| 474 | +) -> None: | |
| 475 | + if not row or row.get("error"): | |
| 476 | + return | |
| 477 | + | |
| 478 | + _append_analysis_attributes( | |
| 479 | + result["enriched_taxonomy_attributes"], | |
| 480 | + row=row, | |
| 481 | + lang=lang, | |
| 482 | + schema=_get_analysis_schema( | |
| 483 | + "taxonomy", | |
| 484 | + category_taxonomy_profile=category_taxonomy_profile, | |
| 485 | + ), | |
| 486 | + field_map=_get_taxonomy_attribute_field_map(category_taxonomy_profile), | |
| 487 | + ) | |
| 488 | + | |
| 489 | + | |
| 330 | 490 | def _normalize_index_content_item(item: Dict[str, Any]) -> Dict[str, str]: |
| 331 | 491 | item_id = _get_product_id(item) |
| 332 | 492 | return { |
| ... | ... | @@ -341,6 +501,8 @@ def _normalize_index_content_item(item: Dict[str, Any]) -> Dict[str, str]: |
| 341 | 501 | def build_index_content_fields( |
| 342 | 502 | items: List[Dict[str, Any]], |
| 343 | 503 | tenant_id: Optional[str] = None, |
| 504 | + enrichment_scopes: Optional[List[str]] = None, | |
| 505 | + category_taxonomy_profile: Optional[str] = None, | |
| 344 | 506 | ) -> List[Dict[str, Any]]: |
| 345 | 507 | """ |
| 346 | 508 | 高层入口:生成与 ES mapping 对齐的内容理解字段。 |
| ... | ... | @@ -349,18 +511,23 @@ def build_index_content_fields( |
| 349 | 511 | - `id` 或 `spu_id` |
| 350 | 512 | - `title` |
| 351 | 513 | - 可选 `brief` / `description` / `image_url` |
| 514 | + - 可选 `enrichment_scopes`,默认同时执行 `generic` 与 `category_taxonomy` | |
| 515 | + - 可选 `category_taxonomy_profile`,默认 `apparel` | |
| 352 | 516 | |
| 353 | 517 | 返回项结构: |
| 354 | 518 | - `id` |
| 355 | 519 | - `qanchors` |
| 356 | 520 | - `enriched_tags` |
| 357 | 521 | - `enriched_attributes` |
| 522 | + - `enriched_taxonomy_attributes` | |
| 358 | 523 | - 可选 `error` |
| 359 | 524 | |
| 360 | 525 | 其中: |
| 361 | 526 | - `qanchors.{lang}` 为短语数组 |
| 362 | 527 | - `enriched_tags.{lang}` 为标签数组 |
| 363 | 528 | """ |
| 529 | + requested_enrichment_scopes = _normalize_enrichment_scopes(enrichment_scopes) | |
| 530 | + normalized_taxonomy_profile = _normalize_category_taxonomy_profile(category_taxonomy_profile) | |
| 364 | 531 | normalized_items = [_normalize_index_content_item(item) for item in items] |
| 365 | 532 | if not normalized_items: |
| 366 | 533 | return [] |
| ... | ... | @@ -371,32 +538,72 @@ def build_index_content_fields( |
| 371 | 538 | "qanchors": {}, |
| 372 | 539 | "enriched_tags": {}, |
| 373 | 540 | "enriched_attributes": [], |
| 541 | + "enriched_taxonomy_attributes": [], | |
| 374 | 542 | } |
| 375 | 543 | for item in normalized_items |
| 376 | 544 | } |
| 377 | 545 | |
| 378 | 546 | for lang in _CORE_INDEX_LANGUAGES: |
| 379 | - try: | |
| 380 | - rows = analyze_products( | |
| 381 | - products=normalized_items, | |
| 382 | - target_lang=lang, | |
| 383 | - batch_size=BATCH_SIZE, | |
| 384 | - tenant_id=tenant_id, | |
| 385 | - ) | |
| 386 | - except Exception as e: | |
| 387 | - logger.warning("build_index_content_fields failed for lang=%s: %s", lang, e) | |
| 388 | - for item in normalized_items: | |
| 389 | - results_by_id[item["id"]].setdefault("error", str(e)) | |
| 390 | - continue | |
| 391 | - | |
| 392 | - for row in rows or []: | |
| 393 | - item_id = str(row.get("id") or "").strip() | |
| 394 | - if not item_id or item_id not in results_by_id: | |
| 547 | + if "generic" in requested_enrichment_scopes: | |
| 548 | + try: | |
| 549 | + rows = analyze_products( | |
| 550 | + products=normalized_items, | |
| 551 | + target_lang=lang, | |
| 552 | + batch_size=BATCH_SIZE, | |
| 553 | + tenant_id=tenant_id, | |
| 554 | + analysis_kind="content", | |
| 555 | + category_taxonomy_profile=normalized_taxonomy_profile, | |
| 556 | + ) | |
| 557 | + except Exception as e: | |
| 558 | + logger.warning("build_index_content_fields content enrichment failed for lang=%s: %s", lang, e) | |
| 559 | + for item in normalized_items: | |
| 560 | + results_by_id[item["id"]].setdefault("error", str(e)) | |
| 395 | 561 | continue |
| 396 | - if row.get("error"): | |
| 397 | - results_by_id[item_id].setdefault("error", row["error"]) | |
| 562 | + | |
| 563 | + for row in rows or []: | |
| 564 | + item_id = str(row.get("id") or "").strip() | |
| 565 | + if not item_id or item_id not in results_by_id: | |
| 566 | + continue | |
| 567 | + if row.get("error"): | |
| 568 | + results_by_id[item_id].setdefault("error", row["error"]) | |
| 569 | + continue | |
| 570 | + _apply_index_content_row(results_by_id[item_id], row=row, lang=lang) | |
| 571 | + | |
| 572 | + if "category_taxonomy" in requested_enrichment_scopes: | |
| 573 | + for lang in _CORE_INDEX_LANGUAGES: | |
| 574 | + try: | |
| 575 | + taxonomy_rows = analyze_products( | |
| 576 | + products=normalized_items, | |
| 577 | + target_lang=lang, | |
| 578 | + batch_size=BATCH_SIZE, | |
| 579 | + tenant_id=tenant_id, | |
| 580 | + analysis_kind="taxonomy", | |
| 581 | + category_taxonomy_profile=normalized_taxonomy_profile, | |
| 582 | + ) | |
| 583 | + except Exception as e: | |
| 584 | + logger.warning( | |
| 585 | + "build_index_content_fields taxonomy enrichment failed for profile=%s lang=%s: %s", | |
| 586 | + normalized_taxonomy_profile, | |
| 587 | + lang, | |
| 588 | + e, | |
| 589 | + ) | |
| 590 | + for item in normalized_items: | |
| 591 | + results_by_id[item["id"]].setdefault("error", str(e)) | |
| 398 | 592 | continue |
| 399 | - _apply_index_content_row(results_by_id[item_id], row=row, lang=lang) | |
| 593 | + | |
| 594 | + for row in taxonomy_rows or []: | |
| 595 | + item_id = str(row.get("id") or "").strip() | |
| 596 | + if not item_id or item_id not in results_by_id: | |
| 597 | + continue | |
| 598 | + if row.get("error"): | |
| 599 | + results_by_id[item_id].setdefault("error", row["error"]) | |
| 600 | + continue | |
| 601 | + _apply_index_taxonomy_row( | |
| 602 | + results_by_id[item_id], | |
| 603 | + row=row, | |
| 604 | + lang=lang, | |
| 605 | + category_taxonomy_profile=normalized_taxonomy_profile, | |
| 606 | + ) | |
| 400 | 607 | |
| 401 | 608 | return [results_by_id[item["id"]] for item in normalized_items] |
| 402 | 609 | |
| ... | ... | @@ -463,52 +670,129 @@ def _build_prompt_input_text(product: Dict[str, Any]) -> str: |
| 463 | 670 | return _truncate_by_words(candidate, PROMPT_INPUT_MAX_WORDS) |
| 464 | 671 | |
| 465 | 672 | |
| 466 | -def _make_anchor_cache_key( | |
| 673 | +def _make_analysis_cache_key( | |
| 467 | 674 | product: Dict[str, Any], |
| 468 | 675 | target_lang: str, |
| 676 | + analysis_kind: str, | |
| 677 | + category_taxonomy_profile: Optional[str] = None, | |
| 469 | 678 | ) -> str: |
| 470 | - """构造缓存 key,仅由 prompt 实际输入文本内容 + 目标语言决定。""" | |
| 679 | + """构造缓存 key,仅由分析类型、prompt 实际输入文本内容与目标语言决定。""" | |
| 680 | + schema = _get_analysis_schema( | |
| 681 | + analysis_kind, | |
| 682 | + category_taxonomy_profile=category_taxonomy_profile, | |
| 683 | + ) | |
| 471 | 684 | prompt_input = _build_prompt_input_text(product) |
| 472 | 685 | h = hashlib.md5(prompt_input.encode("utf-8")).hexdigest() |
| 473 | - return f"{ANCHOR_CACHE_PREFIX}:{target_lang}:{prompt_input[:4]}{h}" | |
| 686 | + prompt_contract = { | |
| 687 | + "schema_name": schema.name, | |
| 688 | + "cache_version": schema.cache_version, | |
| 689 | + "system_message": SYSTEM_MESSAGE, | |
| 690 | + "user_instruction_template": USER_INSTRUCTION_TEMPLATE, | |
| 691 | + "shared_instruction": schema.shared_instruction, | |
| 692 | + "assistant_headers": schema.get_headers(target_lang), | |
| 693 | + "result_fields": schema.result_fields, | |
| 694 | + "meaningful_fields": schema.meaningful_fields, | |
| 695 | + "field_aliases": schema.field_aliases, | |
| 696 | + } | |
| 697 | + prompt_contract_hash = hashlib.md5( | |
| 698 | + json.dumps(prompt_contract, ensure_ascii=False, sort_keys=True).encode("utf-8") | |
| 699 | + ).hexdigest()[:12] | |
| 700 | + return ( | |
| 701 | + f"{ANCHOR_CACHE_PREFIX}:{analysis_kind}:{prompt_contract_hash}:" | |
| 702 | + f"{target_lang}:{prompt_input[:4]}{h}" | |
| 703 | + ) | |
| 474 | 704 | |
| 475 | 705 | |
| 476 | -def _get_cached_anchor_result( | |
| 706 | +def _make_anchor_cache_key( | |
| 477 | 707 | product: Dict[str, Any], |
| 478 | 708 | target_lang: str, |
| 709 | +) -> str: | |
| 710 | + return _make_analysis_cache_key(product, target_lang, analysis_kind="content") | |
| 711 | + | |
| 712 | + | |
| 713 | +def _get_cached_analysis_result( | |
| 714 | + product: Dict[str, Any], | |
| 715 | + target_lang: str, | |
| 716 | + analysis_kind: str, | |
| 717 | + category_taxonomy_profile: Optional[str] = None, | |
| 479 | 718 | ) -> Optional[Dict[str, Any]]: |
| 480 | 719 | if not _anchor_redis: |
| 481 | 720 | return None |
| 721 | + schema = _get_analysis_schema( | |
| 722 | + analysis_kind, | |
| 723 | + category_taxonomy_profile=category_taxonomy_profile, | |
| 724 | + ) | |
| 482 | 725 | try: |
| 483 | - key = _make_anchor_cache_key(product, target_lang) | |
| 726 | + key = _make_analysis_cache_key( | |
| 727 | + product, | |
| 728 | + target_lang, | |
| 729 | + analysis_kind, | |
| 730 | + category_taxonomy_profile=category_taxonomy_profile, | |
| 731 | + ) | |
| 484 | 732 | raw = _anchor_redis.get(key) |
| 485 | 733 | if not raw: |
| 486 | 734 | return None |
| 487 | - result = _normalize_analysis_result(json.loads(raw), product=product, target_lang=target_lang) | |
| 488 | - if not _has_meaningful_analysis_content(result): | |
| 735 | + result = _normalize_analysis_result( | |
| 736 | + json.loads(raw), | |
| 737 | + product=product, | |
| 738 | + target_lang=target_lang, | |
| 739 | + schema=schema, | |
| 740 | + ) | |
| 741 | + if not _has_meaningful_analysis_content(result, schema): | |
| 489 | 742 | return None |
| 490 | 743 | return result |
| 491 | 744 | except Exception as e: |
| 492 | - logger.warning(f"Failed to get anchor cache: {e}") | |
| 745 | + logger.warning("Failed to get %s analysis cache: %s", analysis_kind, e) | |
| 493 | 746 | return None |
| 494 | 747 | |
| 495 | 748 | |
| 496 | -def _set_cached_anchor_result( | |
| 749 | +def _get_cached_anchor_result( | |
| 750 | + product: Dict[str, Any], | |
| 751 | + target_lang: str, | |
| 752 | +) -> Optional[Dict[str, Any]]: | |
| 753 | + return _get_cached_analysis_result(product, target_lang, analysis_kind="content") | |
| 754 | + | |
| 755 | + | |
| 756 | +def _set_cached_analysis_result( | |
| 497 | 757 | product: Dict[str, Any], |
| 498 | 758 | target_lang: str, |
| 499 | 759 | result: Dict[str, Any], |
| 760 | + analysis_kind: str, | |
| 761 | + category_taxonomy_profile: Optional[str] = None, | |
| 500 | 762 | ) -> None: |
| 501 | 763 | if not _anchor_redis: |
| 502 | 764 | return |
| 765 | + schema = _get_analysis_schema( | |
| 766 | + analysis_kind, | |
| 767 | + category_taxonomy_profile=category_taxonomy_profile, | |
| 768 | + ) | |
| 503 | 769 | try: |
| 504 | - normalized = _normalize_analysis_result(result, product=product, target_lang=target_lang) | |
| 505 | - if not _has_meaningful_analysis_content(normalized): | |
| 770 | + normalized = _normalize_analysis_result( | |
| 771 | + result, | |
| 772 | + product=product, | |
| 773 | + target_lang=target_lang, | |
| 774 | + schema=schema, | |
| 775 | + ) | |
| 776 | + if not _has_meaningful_analysis_content(normalized, schema): | |
| 506 | 777 | return |
| 507 | - key = _make_anchor_cache_key(product, target_lang) | |
| 778 | + key = _make_analysis_cache_key( | |
| 779 | + product, | |
| 780 | + target_lang, | |
| 781 | + analysis_kind, | |
| 782 | + category_taxonomy_profile=category_taxonomy_profile, | |
| 783 | + ) | |
| 508 | 784 | ttl = ANCHOR_CACHE_EXPIRE_DAYS * 24 * 3600 |
| 509 | 785 | _anchor_redis.setex(key, ttl, json.dumps(normalized, ensure_ascii=False)) |
| 510 | 786 | except Exception as e: |
| 511 | - logger.warning(f"Failed to set anchor cache: {e}") | |
| 787 | + logger.warning("Failed to set %s analysis cache: %s", analysis_kind, e) | |
| 788 | + | |
| 789 | + | |
| 790 | +def _set_cached_anchor_result( | |
| 791 | + product: Dict[str, Any], | |
| 792 | + target_lang: str, | |
| 793 | + result: Dict[str, Any], | |
| 794 | +) -> None: | |
| 795 | + _set_cached_analysis_result(product, target_lang, result, analysis_kind="content") | |
| 512 | 796 | |
| 513 | 797 | |
| 514 | 798 | def _build_assistant_prefix(headers: List[str]) -> str: |
| ... | ... | @@ -517,8 +801,8 @@ def _build_assistant_prefix(headers: List[str]) -> str: |
| 517 | 801 | return f"{header_line}\n{separator_line}\n" |
| 518 | 802 | |
| 519 | 803 | |
| 520 | -def _build_shared_context(products: List[Dict[str, str]]) -> str: | |
| 521 | - shared_context = SHARED_ANALYSIS_INSTRUCTION | |
| 804 | +def _build_shared_context(products: List[Dict[str, str]], schema: AnalysisSchema) -> str: | |
| 805 | + shared_context = schema.shared_instruction | |
| 522 | 806 | for idx, product in enumerate(products, 1): |
| 523 | 807 | prompt_input = _build_prompt_input_text(product) |
| 524 | 808 | shared_context += f"{idx}. {prompt_input}\n" |
| ... | ... | @@ -550,16 +834,23 @@ def reset_logged_shared_context_keys() -> None: |
| 550 | 834 | def create_prompt( |
| 551 | 835 | products: List[Dict[str, str]], |
| 552 | 836 | target_lang: str = "zh", |
| 553 | -) -> Tuple[str, str, str]: | |
| 837 | + analysis_kind: str = "content", | |
| 838 | + category_taxonomy_profile: Optional[str] = None, | |
| 839 | +) -> Tuple[Optional[str], Optional[str], Optional[str]]: | |
| 554 | 840 | """根据目标语言创建共享上下文、本地化输出要求和 Partial Mode assistant 前缀。""" |
| 555 | - markdown_table_headers = LANGUAGE_MARKDOWN_TABLE_HEADERS.get(target_lang) | |
| 841 | + schema = _get_analysis_schema( | |
| 842 | + analysis_kind, | |
| 843 | + category_taxonomy_profile=category_taxonomy_profile, | |
| 844 | + ) | |
| 845 | + markdown_table_headers = schema.get_headers(target_lang) | |
| 556 | 846 | if not markdown_table_headers: |
| 557 | 847 | logger.warning( |
| 558 | - "Unsupported target_lang for markdown table headers: %s", | |
| 848 | + "Unsupported target_lang for markdown table headers: kind=%s lang=%s", | |
| 849 | + analysis_kind, | |
| 559 | 850 | target_lang, |
| 560 | 851 | ) |
| 561 | 852 | return None, None, None |
| 562 | - shared_context = _build_shared_context(products) | |
| 853 | + shared_context = _build_shared_context(products, schema) | |
| 563 | 854 | language_label = SOURCE_LANG_CODE_MAP.get(target_lang, target_lang) |
| 564 | 855 | user_prompt = USER_INSTRUCTION_TEMPLATE.format(language=language_label).strip() |
| 565 | 856 | assistant_prefix = _build_assistant_prefix(markdown_table_headers) |
| ... | ... | @@ -592,6 +883,7 @@ def call_llm( |
| 592 | 883 | user_prompt: str, |
| 593 | 884 | assistant_prefix: str, |
| 594 | 885 | target_lang: str = "zh", |
| 886 | + analysis_kind: str = "content", | |
| 595 | 887 | ) -> Tuple[str, str]: |
| 596 | 888 | """调用大模型 API(带重试机制),使用 Partial Mode 强制 markdown 表格前缀。""" |
| 597 | 889 | headers = { |
| ... | ... | @@ -631,8 +923,9 @@ def call_llm( |
| 631 | 923 | if _mark_shared_context_logged_once(shared_context_key): |
| 632 | 924 | logger.info(f"\n{'=' * 80}") |
| 633 | 925 | logger.info( |
| 634 | - "LLM Shared Context [model=%s, shared_key=%s, chars=%s] (logged once per process key)", | |
| 926 | + "LLM Shared Context [model=%s, kind=%s, shared_key=%s, chars=%s] (logged once per process key)", | |
| 635 | 927 | MODEL_NAME, |
| 928 | + analysis_kind, | |
| 636 | 929 | shared_context_key, |
| 637 | 930 | len(shared_context), |
| 638 | 931 | ) |
| ... | ... | @@ -641,8 +934,9 @@ def call_llm( |
| 641 | 934 | |
| 642 | 935 | verbose_logger.info(f"\n{'=' * 80}") |
| 643 | 936 | verbose_logger.info( |
| 644 | - "LLM Request [model=%s, lang=%s, shared_key=%s, tail_key=%s]:", | |
| 937 | + "LLM Request [model=%s, kind=%s, lang=%s, shared_key=%s, tail_key=%s]:", | |
| 645 | 938 | MODEL_NAME, |
| 939 | + analysis_kind, | |
| 646 | 940 | target_lang, |
| 647 | 941 | shared_context_key, |
| 648 | 942 | localized_tail_key, |
| ... | ... | @@ -654,7 +948,8 @@ def call_llm( |
| 654 | 948 | verbose_logger.info(f"\nAssistant Prefix:\n{assistant_prefix}") |
| 655 | 949 | |
| 656 | 950 | logger.info( |
| 657 | - "\nLLM Request Variant [lang=%s, shared_key=%s, tail_key=%s, prompt_chars=%s, prefix_chars=%s]", | |
| 951 | + "\nLLM Request Variant [kind=%s, lang=%s, shared_key=%s, tail_key=%s, prompt_chars=%s, prefix_chars=%s]", | |
| 952 | + analysis_kind, | |
| 658 | 953 | target_lang, |
| 659 | 954 | shared_context_key, |
| 660 | 955 | localized_tail_key, |
| ... | ... | @@ -685,8 +980,9 @@ def call_llm( |
| 685 | 980 | usage = result.get("usage") or {} |
| 686 | 981 | |
| 687 | 982 | verbose_logger.info( |
| 688 | - "\nLLM Response [model=%s, lang=%s, shared_key=%s, tail_key=%s]:", | |
| 983 | + "\nLLM Response [model=%s, kind=%s, lang=%s, shared_key=%s, tail_key=%s]:", | |
| 689 | 984 | MODEL_NAME, |
| 985 | + analysis_kind, | |
| 690 | 986 | target_lang, |
| 691 | 987 | shared_context_key, |
| 692 | 988 | localized_tail_key, |
| ... | ... | @@ -697,7 +993,8 @@ def call_llm( |
| 697 | 993 | full_markdown = _merge_partial_response(assistant_prefix, generated_content) |
| 698 | 994 | |
| 699 | 995 | logger.info( |
| 700 | - "\nLLM Response Summary [lang=%s, shared_key=%s, tail_key=%s, generated_chars=%s, completion_tokens=%s, prompt_tokens=%s, total_tokens=%s]", | |
| 996 | + "\nLLM Response Summary [kind=%s, lang=%s, shared_key=%s, tail_key=%s, generated_chars=%s, completion_tokens=%s, prompt_tokens=%s, total_tokens=%s]", | |
| 997 | + analysis_kind, | |
| 701 | 998 | target_lang, |
| 702 | 999 | shared_context_key, |
| 703 | 1000 | localized_tail_key, |
| ... | ... | @@ -742,8 +1039,16 @@ def call_llm( |
| 742 | 1039 | session.close() |
| 743 | 1040 | |
| 744 | 1041 | |
| 745 | -def parse_markdown_table(markdown_content: str) -> List[Dict[str, str]]: | |
| 1042 | +def parse_markdown_table( | |
| 1043 | + markdown_content: str, | |
| 1044 | + analysis_kind: str = "content", | |
| 1045 | + category_taxonomy_profile: Optional[str] = None, | |
| 1046 | +) -> List[Dict[str, str]]: | |
| 746 | 1047 | """解析markdown表格内容""" |
| 1048 | + schema = _get_analysis_schema( | |
| 1049 | + analysis_kind, | |
| 1050 | + category_taxonomy_profile=category_taxonomy_profile, | |
| 1051 | + ) | |
| 747 | 1052 | lines = markdown_content.strip().split("\n") |
| 748 | 1053 | data = [] |
| 749 | 1054 | data_started = False |
| ... | ... | @@ -768,22 +1073,16 @@ def parse_markdown_table(markdown_content: str) -> List[Dict[str, str]]: |
| 768 | 1073 | |
| 769 | 1074 | # 解析数据行 |
| 770 | 1075 | parts = [p.strip() for p in line.split("|")] |
| 771 | - parts = [p for p in parts if p] # 移除空字符串 | |
| 1076 | + if parts and parts[0] == "": | |
| 1077 | + parts = parts[1:] | |
| 1078 | + if parts and parts[-1] == "": | |
| 1079 | + parts = parts[:-1] | |
| 772 | 1080 | |
| 773 | 1081 | if len(parts) >= 2: |
| 774 | - row = { | |
| 775 | - "seq_no": parts[0], | |
| 776 | - "title": parts[1], # 商品标题(按目标语言) | |
| 777 | - "category_path": parts[2] if len(parts) > 2 else "", # 品类路径 | |
| 778 | - "tags": parts[3] if len(parts) > 3 else "", # 细分标签 | |
| 779 | - "target_audience": parts[4] if len(parts) > 4 else "", # 适用人群 | |
| 780 | - "usage_scene": parts[5] if len(parts) > 5 else "", # 使用场景 | |
| 781 | - "season": parts[6] if len(parts) > 6 else "", # 适用季节 | |
| 782 | - "key_attributes": parts[7] if len(parts) > 7 else "", # 关键属性 | |
| 783 | - "material": parts[8] if len(parts) > 8 else "", # 材质说明 | |
| 784 | - "features": parts[9] if len(parts) > 9 else "", # 功能特点 | |
| 785 | - "anchor_text": parts[10] if len(parts) > 10 else "", # 锚文本 | |
| 786 | - } | |
| 1082 | + row = {"seq_no": parts[0]} | |
| 1083 | + for field_index, field_name in enumerate(schema.result_fields, start=1): | |
| 1084 | + cell = parts[field_index] if len(parts) > field_index else "" | |
| 1085 | + row[field_name] = _normalize_markdown_table_cell(cell) | |
| 787 | 1086 | data.append(row) |
| 788 | 1087 | |
| 789 | 1088 | return data |
| ... | ... | @@ -794,31 +1093,49 @@ def _log_parsed_result_quality( |
| 794 | 1093 | parsed_results: List[Dict[str, str]], |
| 795 | 1094 | target_lang: str, |
| 796 | 1095 | batch_num: int, |
| 1096 | + analysis_kind: str, | |
| 1097 | + category_taxonomy_profile: Optional[str] = None, | |
| 797 | 1098 | ) -> None: |
| 1099 | + schema = _get_analysis_schema( | |
| 1100 | + analysis_kind, | |
| 1101 | + category_taxonomy_profile=category_taxonomy_profile, | |
| 1102 | + ) | |
| 798 | 1103 | expected = len(batch_data) |
| 799 | 1104 | actual = len(parsed_results) |
| 800 | 1105 | if actual != expected: |
| 801 | 1106 | logger.warning( |
| 802 | - "Parsed row count mismatch for batch=%s lang=%s: expected=%s actual=%s", | |
| 1107 | + "Parsed row count mismatch for kind=%s batch=%s lang=%s: expected=%s actual=%s", | |
| 1108 | + analysis_kind, | |
| 803 | 1109 | batch_num, |
| 804 | 1110 | target_lang, |
| 805 | 1111 | expected, |
| 806 | 1112 | actual, |
| 807 | 1113 | ) |
| 808 | 1114 | |
| 809 | - missing_anchor = sum(1 for item in parsed_results if not str(item.get("anchor_text") or "").strip()) | |
| 810 | - missing_category = sum(1 for item in parsed_results if not str(item.get("category_path") or "").strip()) | |
| 811 | - missing_title = sum(1 for item in parsed_results if not str(item.get("title") or "").strip()) | |
| 1115 | + if not schema.quality_fields: | |
| 1116 | + logger.info( | |
| 1117 | + "Parsed Quality Summary [kind=%s, batch=%s, lang=%s]: rows=%s/%s", | |
| 1118 | + analysis_kind, | |
| 1119 | + batch_num, | |
| 1120 | + target_lang, | |
| 1121 | + actual, | |
| 1122 | + expected, | |
| 1123 | + ) | |
| 1124 | + return | |
| 812 | 1125 | |
| 1126 | + missing_summary = ", ".join( | |
| 1127 | + f"missing_{field}=" | |
| 1128 | + f"{sum(1 for item in parsed_results if not str(item.get(field) or '').strip())}" | |
| 1129 | + for field in schema.quality_fields | |
| 1130 | + ) | |
| 813 | 1131 | logger.info( |
| 814 | - "Parsed Quality Summary [batch=%s, lang=%s]: rows=%s/%s, missing_title=%s, missing_category=%s, missing_anchor=%s", | |
| 1132 | + "Parsed Quality Summary [kind=%s, batch=%s, lang=%s]: rows=%s/%s, %s", | |
| 1133 | + analysis_kind, | |
| 815 | 1134 | batch_num, |
| 816 | 1135 | target_lang, |
| 817 | 1136 | actual, |
| 818 | 1137 | expected, |
| 819 | - missing_title, | |
| 820 | - missing_category, | |
| 821 | - missing_anchor, | |
| 1138 | + missing_summary, | |
| 822 | 1139 | ) |
| 823 | 1140 | |
| 824 | 1141 | |
| ... | ... | @@ -826,29 +1143,44 @@ def process_batch( |
| 826 | 1143 | batch_data: List[Dict[str, str]], |
| 827 | 1144 | batch_num: int, |
| 828 | 1145 | target_lang: str = "zh", |
| 1146 | + analysis_kind: str = "content", | |
| 1147 | + category_taxonomy_profile: Optional[str] = None, | |
| 829 | 1148 | ) -> List[Dict[str, Any]]: |
| 830 | 1149 | """处理一个批次的数据""" |
| 1150 | + schema = _get_analysis_schema( | |
| 1151 | + analysis_kind, | |
| 1152 | + category_taxonomy_profile=category_taxonomy_profile, | |
| 1153 | + ) | |
| 831 | 1154 | logger.info(f"\n{'#' * 80}") |
| 832 | - logger.info(f"Processing Batch {batch_num} ({len(batch_data)} items)") | |
| 1155 | + logger.info( | |
| 1156 | + "Processing Batch %s (%s items, kind=%s)", | |
| 1157 | + batch_num, | |
| 1158 | + len(batch_data), | |
| 1159 | + analysis_kind, | |
| 1160 | + ) | |
| 833 | 1161 | |
| 834 | 1162 | # 创建提示词 |
| 835 | 1163 | shared_context, user_prompt, assistant_prefix = create_prompt( |
| 836 | 1164 | batch_data, |
| 837 | 1165 | target_lang=target_lang, |
| 1166 | + analysis_kind=analysis_kind, | |
| 1167 | + category_taxonomy_profile=category_taxonomy_profile, | |
| 838 | 1168 | ) |
| 839 | 1169 | |
| 840 | 1170 | # 如果提示词创建失败(例如不支持的 target_lang),本次批次整体失败,不再继续调用 LLM |
| 841 | 1171 | if shared_context is None or user_prompt is None or assistant_prefix is None: |
| 842 | 1172 | logger.error( |
| 843 | - "Failed to create prompt for batch %s, target_lang=%s; " | |
| 1173 | + "Failed to create prompt for batch %s, kind=%s, target_lang=%s; " | |
| 844 | 1174 | "marking entire batch as failed without calling LLM", |
| 845 | 1175 | batch_num, |
| 1176 | + analysis_kind, | |
| 846 | 1177 | target_lang, |
| 847 | 1178 | ) |
| 848 | 1179 | return [ |
| 849 | 1180 | _make_empty_analysis_result( |
| 850 | 1181 | item, |
| 851 | 1182 | target_lang, |
| 1183 | + schema, | |
| 852 | 1184 | error=f"prompt_creation_failed: unsupported target_lang={target_lang}", |
| 853 | 1185 | ) |
| 854 | 1186 | for item in batch_data |
| ... | ... | @@ -861,11 +1193,23 @@ def process_batch( |
| 861 | 1193 | user_prompt, |
| 862 | 1194 | assistant_prefix, |
| 863 | 1195 | target_lang=target_lang, |
| 1196 | + analysis_kind=analysis_kind, | |
| 864 | 1197 | ) |
| 865 | 1198 | |
| 866 | 1199 | # 解析结果 |
| 867 | - parsed_results = parse_markdown_table(raw_response) | |
| 868 | - _log_parsed_result_quality(batch_data, parsed_results, target_lang, batch_num) | |
| 1200 | + parsed_results = parse_markdown_table( | |
| 1201 | + raw_response, | |
| 1202 | + analysis_kind=analysis_kind, | |
| 1203 | + category_taxonomy_profile=category_taxonomy_profile, | |
| 1204 | + ) | |
| 1205 | + _log_parsed_result_quality( | |
| 1206 | + batch_data, | |
| 1207 | + parsed_results, | |
| 1208 | + target_lang, | |
| 1209 | + batch_num, | |
| 1210 | + analysis_kind, | |
| 1211 | + category_taxonomy_profile, | |
| 1212 | + ) | |
| 869 | 1213 | |
| 870 | 1214 | logger.info(f"\nParsed Results ({len(parsed_results)} items):") |
| 871 | 1215 | logger.info(json.dumps(parsed_results, ensure_ascii=False, indent=2)) |
| ... | ... | @@ -879,10 +1223,12 @@ def process_batch( |
| 879 | 1223 | parsed_item, |
| 880 | 1224 | product=source_product, |
| 881 | 1225 | target_lang=target_lang, |
| 1226 | + schema=schema, | |
| 882 | 1227 | ) |
| 883 | 1228 | results_with_ids.append(result) |
| 884 | 1229 | logger.info( |
| 885 | - "Mapped: seq=%s -> original_id=%s", | |
| 1230 | + "Mapped: kind=%s seq=%s -> original_id=%s", | |
| 1231 | + analysis_kind, | |
| 886 | 1232 | parsed_item.get("seq_no"), |
| 887 | 1233 | source_product.get("id"), |
| 888 | 1234 | ) |
| ... | ... | @@ -890,6 +1236,7 @@ def process_batch( |
| 890 | 1236 | # 保存批次 JSON 日志到独立文件 |
| 891 | 1237 | batch_log = { |
| 892 | 1238 | "batch_num": batch_num, |
| 1239 | + "analysis_kind": analysis_kind, | |
| 893 | 1240 | "timestamp": datetime.now().isoformat(), |
| 894 | 1241 | "input_products": batch_data, |
| 895 | 1242 | "raw_response": raw_response, |
| ... | ... | @@ -900,7 +1247,10 @@ def process_batch( |
| 900 | 1247 | |
| 901 | 1248 | # 并发写 batch json 日志时,保证文件名唯一避免覆盖 |
| 902 | 1249 | batch_call_id = uuid.uuid4().hex[:12] |
| 903 | - batch_log_file = LOG_DIR / f"batch_{batch_num:04d}_{timestamp}_{batch_call_id}.json" | |
| 1250 | + batch_log_file = ( | |
| 1251 | + LOG_DIR | |
| 1252 | + / f"batch_{analysis_kind}_{batch_num:04d}_{timestamp}_{batch_call_id}.json" | |
| 1253 | + ) | |
| 904 | 1254 | with open(batch_log_file, "w", encoding="utf-8") as f: |
| 905 | 1255 | json.dump(batch_log, f, ensure_ascii=False, indent=2) |
| 906 | 1256 | |
| ... | ... | @@ -912,7 +1262,7 @@ def process_batch( |
| 912 | 1262 | logger.error(f"Error processing batch {batch_num}: {str(e)}", exc_info=True) |
| 913 | 1263 | # 返回空结果,保持ID映射 |
| 914 | 1264 | return [ |
| 915 | - _make_empty_analysis_result(item, target_lang, error=str(e)) | |
| 1265 | + _make_empty_analysis_result(item, target_lang, schema, error=str(e)) | |
| 916 | 1266 | for item in batch_data |
| 917 | 1267 | ] |
| 918 | 1268 | |
| ... | ... | @@ -922,6 +1272,8 @@ def analyze_products( |
| 922 | 1272 | target_lang: str = "zh", |
| 923 | 1273 | batch_size: Optional[int] = None, |
| 924 | 1274 | tenant_id: Optional[str] = None, |
| 1275 | + analysis_kind: str = "content", | |
| 1276 | + category_taxonomy_profile: Optional[str] = None, | |
| 925 | 1277 | ) -> List[Dict[str, Any]]: |
| 926 | 1278 | """ |
| 927 | 1279 | 库调用入口:根据输入+语言,返回锚文本及各维度信息。 |
| ... | ... | @@ -937,6 +1289,10 @@ def analyze_products( |
| 937 | 1289 | if not products: |
| 938 | 1290 | return [] |
| 939 | 1291 | |
| 1292 | + _get_analysis_schema( | |
| 1293 | + analysis_kind, | |
| 1294 | + category_taxonomy_profile=category_taxonomy_profile, | |
| 1295 | + ) | |
| 940 | 1296 | results_by_index: List[Optional[Dict[str, Any]]] = [None] * len(products) |
| 941 | 1297 | uncached_items: List[Tuple[int, Dict[str, str]]] = [] |
| 942 | 1298 | |
| ... | ... | @@ -946,11 +1302,16 @@ def analyze_products( |
| 946 | 1302 | uncached_items.append((idx, product)) |
| 947 | 1303 | continue |
| 948 | 1304 | |
| 949 | - cached = _get_cached_anchor_result(product, target_lang) | |
| 1305 | + cached = _get_cached_analysis_result( | |
| 1306 | + product, | |
| 1307 | + target_lang, | |
| 1308 | + analysis_kind, | |
| 1309 | + category_taxonomy_profile=category_taxonomy_profile, | |
| 1310 | + ) | |
| 950 | 1311 | if cached: |
| 951 | 1312 | logger.info( |
| 952 | 1313 | f"[analyze_products] Cache hit for title='{title[:50]}...', " |
| 953 | - f"lang={target_lang}" | |
| 1314 | + f"kind={analysis_kind}, lang={target_lang}" | |
| 954 | 1315 | ) |
| 955 | 1316 | results_by_index[idx] = cached |
| 956 | 1317 | continue |
| ... | ... | @@ -979,9 +1340,15 @@ def analyze_products( |
| 979 | 1340 | for batch_num, batch_slice, batch in batch_jobs: |
| 980 | 1341 | logger.info( |
| 981 | 1342 | f"[analyze_products] Processing batch {batch_num}/{total_batches}, " |
| 982 | - f"size={len(batch)}, target_lang={target_lang}" | |
| 1343 | + f"size={len(batch)}, kind={analysis_kind}, target_lang={target_lang}" | |
| 1344 | + ) | |
| 1345 | + batch_results = process_batch( | |
| 1346 | + batch, | |
| 1347 | + batch_num=batch_num, | |
| 1348 | + target_lang=target_lang, | |
| 1349 | + analysis_kind=analysis_kind, | |
| 1350 | + category_taxonomy_profile=category_taxonomy_profile, | |
| 983 | 1351 | ) |
| 984 | - batch_results = process_batch(batch, batch_num=batch_num, target_lang=target_lang) | |
| 985 | 1352 | |
| 986 | 1353 | for (original_idx, product), item in zip(batch_slice, batch_results): |
| 987 | 1354 | results_by_index[original_idx] = item |
| ... | ... | @@ -992,7 +1359,13 @@ def analyze_products( |
| 992 | 1359 | # 不缓存错误结果,避免放大临时故障 |
| 993 | 1360 | continue |
| 994 | 1361 | try: |
| 995 | - _set_cached_anchor_result(product, target_lang, item) | |
| 1362 | + _set_cached_analysis_result( | |
| 1363 | + product, | |
| 1364 | + target_lang, | |
| 1365 | + item, | |
| 1366 | + analysis_kind, | |
| 1367 | + category_taxonomy_profile=category_taxonomy_profile, | |
| 1368 | + ) | |
| 996 | 1369 | except Exception: |
| 997 | 1370 | # 已在内部记录 warning |
| 998 | 1371 | pass |
| ... | ... | @@ -1000,10 +1373,11 @@ def analyze_products( |
| 1000 | 1373 | max_workers = min(CONTENT_UNDERSTANDING_MAX_WORKERS, len(batch_jobs)) |
| 1001 | 1374 | logger.info( |
| 1002 | 1375 | "[analyze_products] Using ThreadPoolExecutor for uncached batches: " |
| 1003 | - "max_workers=%s, total_batches=%s, bs=%s, target_lang=%s", | |
| 1376 | + "max_workers=%s, total_batches=%s, bs=%s, kind=%s, target_lang=%s", | |
| 1004 | 1377 | max_workers, |
| 1005 | 1378 | total_batches, |
| 1006 | 1379 | bs, |
| 1380 | + analysis_kind, | |
| 1007 | 1381 | target_lang, |
| 1008 | 1382 | ) |
| 1009 | 1383 | |
| ... | ... | @@ -1013,7 +1387,12 @@ def analyze_products( |
| 1013 | 1387 | future_by_batch_num: Dict[int, Any] = {} |
| 1014 | 1388 | for batch_num, _batch_slice, batch in batch_jobs: |
| 1015 | 1389 | future_by_batch_num[batch_num] = executor.submit( |
| 1016 | - process_batch, batch, batch_num=batch_num, target_lang=target_lang | |
| 1390 | + process_batch, | |
| 1391 | + batch, | |
| 1392 | + batch_num=batch_num, | |
| 1393 | + target_lang=target_lang, | |
| 1394 | + analysis_kind=analysis_kind, | |
| 1395 | + category_taxonomy_profile=category_taxonomy_profile, | |
| 1017 | 1396 | ) |
| 1018 | 1397 | |
| 1019 | 1398 | # 按 batch_num 回填,确保输出稳定(results_by_index 是按原始 input index 映射的) |
| ... | ... | @@ -1028,7 +1407,13 @@ def analyze_products( |
| 1028 | 1407 | # 不缓存错误结果,避免放大临时故障 |
| 1029 | 1408 | continue |
| 1030 | 1409 | try: |
| 1031 | - _set_cached_anchor_result(product, target_lang, item) | |
| 1410 | + _set_cached_analysis_result( | |
| 1411 | + product, | |
| 1412 | + target_lang, | |
| 1413 | + item, | |
| 1414 | + analysis_kind, | |
| 1415 | + category_taxonomy_profile=category_taxonomy_profile, | |
| 1416 | + ) | |
| 1032 | 1417 | except Exception: |
| 1033 | 1418 | # 已在内部记录 warning |
| 1034 | 1419 | pass | ... | ... |
indexer/product_enrich_prompts.py
| 1 | 1 | #!/usr/bin/env python3 |
| 2 | 2 | |
| 3 | -from typing import Any, Dict | |
| 3 | +from typing import Any, Dict, Tuple | |
| 4 | 4 | |
| 5 | 5 | SYSTEM_MESSAGE = ( |
| 6 | 6 | "You are an e-commerce product annotator. " |
| ... | ... | @@ -33,6 +33,337 @@ Input product list: |
| 33 | 33 | USER_INSTRUCTION_TEMPLATE = """Please strictly return a Markdown table following the given columns in the specified language. For any column containing multiple values, separate them with commas. Do not add any other explanation. |
| 34 | 34 | Language: {language}""" |
| 35 | 35 | |
| 36 | +def _taxonomy_field( | |
| 37 | + key: str, | |
| 38 | + label: str, | |
| 39 | + description: str, | |
| 40 | + zh_label: str | None = None, | |
| 41 | +) -> Dict[str, str]: | |
| 42 | + return { | |
| 43 | + "key": key, | |
| 44 | + "label": label, | |
| 45 | + "description": description, | |
| 46 | + "zh_label": zh_label or label, | |
| 47 | + } | |
| 48 | + | |
| 49 | + | |
| 50 | +def _build_taxonomy_shared_instruction(profile_label: str, fields: Tuple[Dict[str, str], ...]) -> str: | |
| 51 | + lines = [ | |
| 52 | + f"Analyze each input product text and fill the columns below using a {profile_label} attribute taxonomy.", | |
| 53 | + "", | |
| 54 | + "Output columns:", | |
| 55 | + ] | |
| 56 | + for idx, field in enumerate(fields, start=1): | |
| 57 | + lines.append(f"{idx}. {field['label']}: {field['description']}") | |
| 58 | + lines.extend( | |
| 59 | + [ | |
| 60 | + "", | |
| 61 | + "Rules:", | |
| 62 | + "- Keep the same row order and row count as input.", | |
| 63 | + "- Leave blank if not applicable, unmentioned, or unsupported.", | |
| 64 | + "- Use concise, standardized ecommerce wording.", | |
| 65 | + "- If multiple values, separate with commas.", | |
| 66 | + "", | |
| 67 | + "Input product list:", | |
| 68 | + ] | |
| 69 | + ) | |
| 70 | + return "\n".join(lines) | |
| 71 | + | |
| 72 | + | |
| 73 | +def _make_taxonomy_profile( | |
| 74 | + profile_label: str, | |
| 75 | + fields: Tuple[Dict[str, str], ...], | |
| 76 | +) -> Dict[str, Any]: | |
| 77 | + headers = { | |
| 78 | + "en": ["No.", *[field["label"] for field in fields]], | |
| 79 | + "zh": ["序号", *[field["zh_label"] for field in fields]], | |
| 80 | + } | |
| 81 | + return { | |
| 82 | + "profile_label": profile_label, | |
| 83 | + "fields": fields, | |
| 84 | + "shared_instruction": _build_taxonomy_shared_instruction(profile_label, fields), | |
| 85 | + "markdown_table_headers": headers, | |
| 86 | + } | |
| 87 | + | |
| 88 | + | |
| 89 | +APPAREL_TAXONOMY_FIELDS = ( | |
| 90 | + _taxonomy_field("product_type", "Product Type", "concise ecommerce apparel category label, not a full marketing title", "品类"), | |
| 91 | + _taxonomy_field("target_gender", "Target Gender", "intended gender only if clearly implied", "目标性别"), | |
| 92 | + _taxonomy_field("age_group", "Age Group", "only if clearly implied, e.g. adults, kids, teens, toddlers, babies", "年龄段"), | |
| 93 | + _taxonomy_field("season", "Season", "season(s) or all-season suitability only if supported", "适用季节"), | |
| 94 | + _taxonomy_field("fit", "Fit", "body closeness, e.g. slim, regular, relaxed, oversized, fitted", "版型"), | |
| 95 | + _taxonomy_field("silhouette", "Silhouette", "overall garment shape, e.g. straight, A-line, boxy, tapered, bodycon, wide-leg", "廓形"), | |
| 96 | + _taxonomy_field("neckline", "Neckline", "neckline type when applicable, e.g. crew neck, V-neck, hooded, collared, square neck", "领型"), | |
| 97 | + _taxonomy_field("sleeve_length_type", "Sleeve Length Type", "sleeve length only, e.g. sleeveless, short sleeve, long sleeve, three-quarter sleeve", "袖长类型"), | |
| 98 | + _taxonomy_field("sleeve_style", "Sleeve Style", "sleeve design only, e.g. puff sleeve, raglan sleeve, batwing sleeve, bell sleeve", "袖型"), | |
| 99 | + _taxonomy_field("strap_type", "Strap Type", "strap design when applicable, e.g. spaghetti strap, wide strap, halter strap, adjustable strap", "肩带设计"), | |
| 100 | + _taxonomy_field("rise_waistline", "Rise / Waistline", "waist placement when applicable, e.g. high rise, mid rise, low rise, empire waist", "腰型"), | |
| 101 | + _taxonomy_field("leg_shape", "Leg Shape", "for bottoms only, e.g. straight leg, wide leg, flare leg, tapered leg, skinny leg", "裤型"), | |
| 102 | + _taxonomy_field("skirt_shape", "Skirt Shape", "for skirts only, e.g. A-line, pleated, pencil, mermaid", "裙型"), | |
| 103 | + _taxonomy_field("length_type", "Length Type", "design length only, not size, e.g. cropped, regular, longline, mini, midi, maxi, ankle length, full length", "长度类型"), | |
| 104 | + _taxonomy_field("closure_type", "Closure Type", "fastening method when applicable, e.g. zipper, button, drawstring, elastic waist, hook-and-loop", "闭合方式"), | |
| 105 | + _taxonomy_field("design_details", "Design Details", "construction or visual details, e.g. ruched, ruffled, pleated, cut-out, layered, distressed, split hem", "设计细节"), | |
| 106 | + _taxonomy_field("fabric", "Fabric", "fabric type only, e.g. denim, knit, chiffon, jersey, fleece, cotton twill", "面料"), | |
| 107 | + _taxonomy_field("material_composition", "Material Composition", "fiber content or blend only if stated, e.g. cotton, polyester, spandex, linen blend, 95% cotton 5% elastane", "成分"), | |
| 108 | + _taxonomy_field("fabric_properties", "Fabric Properties", "inherent fabric traits, e.g. stretch, breathable, lightweight, soft-touch, water-resistant", "面料特性"), | |
| 109 | + _taxonomy_field("clothing_features", "Clothing Features", "product features, e.g. lined, reversible, hooded, packable, padded, pocketed", "服装特征"), | |
| 110 | + _taxonomy_field("functional_benefits", "Functional Benefits", "wearer benefits, e.g. moisture-wicking, thermal insulation, UV protection, easy care, supportive compression", "功能"), | |
| 111 | + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"), | |
| 112 | + _taxonomy_field("color_family", "Color Family", "normalized broad retail color group, e.g. black, white, blue, green, red, pink, beige, brown, gray", "色系"), | |
| 113 | + _taxonomy_field("print_pattern", "Print / Pattern", "surface pattern when applicable, e.g. solid, striped, plaid, floral, graphic, animal print", "印花 / 图案"), | |
| 114 | + _taxonomy_field("occasion_end_use", "Occasion / End Use", "likely use occasion only if supported, e.g. office, casual wear, streetwear, lounge, workout, outdoor", "适用场景"), | |
| 115 | + _taxonomy_field("style_aesthetic", "Style Aesthetic", "overall style only if supported, e.g. minimalist, streetwear, athleisure, smart casual, romantic, playful", "风格"), | |
| 116 | +) | |
| 117 | + | |
| 118 | +THREE_C_TAXONOMY_FIELDS = ( | |
| 119 | + _taxonomy_field("product_type", "Product Type", "concise 3C accessory or peripheral category label", "品类"), | |
| 120 | + _taxonomy_field("compatible_device", "Compatible Device / Model", "supported device family, series, model, or form factor when clearly stated", "适配设备 / 型号"), | |
| 121 | + _taxonomy_field("connectivity", "Connectivity", "connection method such as wired, wireless, Bluetooth, Wi-Fi, NFC, or 2.4G", "连接方式"), | |
| 122 | + _taxonomy_field("interface_port_type", "Interface / Port Type", "relevant connector or port, e.g. USB-C, Lightning, HDMI, AUX, RJ45", "接口 / 端口类型"), | |
| 123 | + _taxonomy_field("power_charging", "Power Source / Charging", "charging or power mode, e.g. battery powered, fast charging, rechargeable, plug-in", "供电 / 充电方式"), | |
| 124 | + _taxonomy_field("key_features", "Key Features", "primary hardware features such as noise cancelling, foldable, magnetic, backlit, waterproof", "关键特征"), | |
| 125 | + _taxonomy_field("material_finish", "Material / Finish", "main material or exterior finish when supported", "材质 / 表面处理"), | |
| 126 | + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"), | |
| 127 | + _taxonomy_field("pack_size", "Pack Size", "unit count or bundle size when stated", "包装规格"), | |
| 128 | + _taxonomy_field("use_case", "Use Case", "intended usage such as travel, office, gaming, car, charging, streaming", "使用场景"), | |
| 129 | +) | |
| 130 | + | |
| 131 | +BAGS_TAXONOMY_FIELDS = ( | |
| 132 | + _taxonomy_field("product_type", "Product Type", "concise bag category such as backpack, tote bag, crossbody bag, luggage, or wallet", "品类"), | |
| 133 | + _taxonomy_field("target_gender", "Target Gender", "intended gender only if clearly implied", "目标性别"), | |
| 134 | + _taxonomy_field("carry_style", "Carry Style", "how the bag is worn or carried, e.g. handheld, shoulder, crossbody, backpack", "携带方式"), | |
| 135 | + _taxonomy_field("size_capacity", "Size / Capacity", "size tier or capacity when supported, e.g. mini, large capacity, 20L", "尺寸 / 容量"), | |
| 136 | + _taxonomy_field("material", "Material", "main bag material such as leather, nylon, canvas, PU, straw", "材质"), | |
| 137 | + _taxonomy_field("closure_type", "Closure Type", "bag closure such as zipper, flap, buckle, drawstring, magnetic snap", "闭合方式"), | |
| 138 | + _taxonomy_field("structure_compartments", "Structure / Compartments", "organizational structure such as multi-pocket, laptop sleeve, card slots, expandable", "结构 / 分层"), | |
| 139 | + _taxonomy_field("strap_handle_type", "Strap / Handle Type", "strap or handle design such as chain strap, top handle, adjustable strap", "肩带 / 提手类型"), | |
| 140 | + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"), | |
| 141 | + _taxonomy_field("occasion_end_use", "Occasion / End Use", "likely use such as commute, travel, evening, school, casual", "适用场景"), | |
| 142 | +) | |
| 143 | + | |
| 144 | +PET_SUPPLIES_TAXONOMY_FIELDS = ( | |
| 145 | + _taxonomy_field("product_type", "Product Type", "concise pet supplies category label", "品类"), | |
| 146 | + _taxonomy_field("pet_type", "Pet Type", "target pet such as dog, cat, bird, fish, hamster", "宠物类型"), | |
| 147 | + _taxonomy_field("breed_size", "Breed Size", "pet size or breed size when stated, e.g. small breed, large dogs", "体型 / 品种大小"), | |
| 148 | + _taxonomy_field("life_stage", "Life Stage", "pet age stage when supported, e.g. puppy, kitten, adult, senior", "成长阶段"), | |
| 149 | + _taxonomy_field("material_ingredients", "Material / Ingredients", "main material or ingredient composition when supported", "材质 / 成分"), | |
| 150 | + _taxonomy_field("flavor_scent", "Flavor / Scent", "flavor or scent when applicable", "口味 / 气味"), | |
| 151 | + _taxonomy_field("key_features", "Key Features", "primary attributes such as interactive, leak-proof, orthopedic, washable, elevated", "关键特征"), | |
| 152 | + _taxonomy_field("functional_benefits", "Functional Benefits", "benefits such as dental care, calming, digestion support, joint support", "功能"), | |
| 153 | + _taxonomy_field("size_capacity", "Size / Capacity", "size, count, or net content when stated", "尺寸 / 容量"), | |
| 154 | + _taxonomy_field("use_scenario", "Use Scenario", "usage such as feeding, training, grooming, travel, indoor play", "使用场景"), | |
| 155 | +) | |
| 156 | + | |
| 157 | +ELECTRONICS_TAXONOMY_FIELDS = ( | |
| 158 | + _taxonomy_field("product_type", "Product Type", "concise electronics device or component category label", "品类"), | |
| 159 | + _taxonomy_field("device_category", "Device Category / Compatibility", "supported platform, component class, or compatible device family when stated", "设备类别 / 兼容性"), | |
| 160 | + _taxonomy_field("power_voltage", "Power / Voltage", "power, voltage, wattage, or battery spec when supported", "功率 / 电压"), | |
| 161 | + _taxonomy_field("connectivity", "Connectivity", "connection method such as wired, Bluetooth, Wi-Fi, RF, or smart app control", "连接方式"), | |
| 162 | + _taxonomy_field("interface_port_type", "Interface / Port Type", "relevant port or interface such as USB-C, AC plug type, HDMI, SATA", "接口 / 端口类型"), | |
| 163 | + _taxonomy_field("capacity_storage", "Capacity / Storage", "capacity or storage spec such as 256GB, 2TB, 5000mAh", "容量 / 存储"), | |
| 164 | + _taxonomy_field("key_features", "Key Features", "main product features such as touch control, HD display, noise reduction, smart control", "关键特征"), | |
| 165 | + _taxonomy_field("material_finish", "Material / Finish", "main housing material or finish when supported", "材质 / 表面处理"), | |
| 166 | + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"), | |
| 167 | + _taxonomy_field("use_case", "Use Case", "intended use such as home entertainment, office, charging, security, repair", "使用场景"), | |
| 168 | +) | |
| 169 | + | |
| 170 | +OUTDOOR_TAXONOMY_FIELDS = ( | |
| 171 | + _taxonomy_field("product_type", "Product Type", "concise outdoor gear category label", "品类"), | |
| 172 | + _taxonomy_field("activity_type", "Activity Type", "primary outdoor activity such as camping, hiking, fishing, climbing, travel", "活动类型"), | |
| 173 | + _taxonomy_field("season_weather", "Season / Weather", "season or weather suitability when supported", "适用季节 / 天气"), | |
| 174 | + _taxonomy_field("material", "Material", "main material such as aluminum, ripstop nylon, stainless steel, EVA", "材质"), | |
| 175 | + _taxonomy_field("capacity_size", "Capacity / Size", "size, length, or capacity when stated", "容量 / 尺寸"), | |
| 176 | + _taxonomy_field("protection_resistance", "Protection / Resistance", "resistance or protection such as waterproof, UV resistant, windproof", "防护 / 耐受性"), | |
| 177 | + _taxonomy_field("key_features", "Key Features", "primary gear attributes such as foldable, lightweight, insulated, non-slip", "关键特征"), | |
| 178 | + _taxonomy_field("portability_packability", "Portability / Packability", "carry or storage trait such as collapsible, compact, ultralight, packable", "便携 / 收纳性"), | |
| 179 | + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"), | |
| 180 | + _taxonomy_field("use_scenario", "Use Scenario", "likely use setting such as campsite, trail, survival kit, beach, picnic", "使用场景"), | |
| 181 | +) | |
| 182 | + | |
| 183 | +HOME_APPLIANCES_TAXONOMY_FIELDS = ( | |
| 184 | + _taxonomy_field("product_type", "Product Type", "concise home appliance category label", "品类"), | |
| 185 | + _taxonomy_field("appliance_category", "Appliance Category", "functional class such as kitchen appliance, cleaning appliance, personal care appliance", "家电类别"), | |
| 186 | + _taxonomy_field("power_voltage", "Power / Voltage", "wattage, voltage, plug type, or power supply when supported", "功率 / 电压"), | |
| 187 | + _taxonomy_field("capacity_coverage", "Capacity / Coverage", "capacity or coverage metric such as 1.5L, 20L, 40sqm", "容量 / 覆盖范围"), | |
| 188 | + _taxonomy_field("control_method", "Control Method", "operation method such as touch, knob, remote, app control", "控制方式"), | |
| 189 | + _taxonomy_field("installation_type", "Installation Type", "setup style such as countertop, handheld, portable, wall-mounted, built-in", "安装方式"), | |
| 190 | + _taxonomy_field("key_features", "Key Features", "main product features such as timer, steam, HEPA filter, self-cleaning", "关键特征"), | |
| 191 | + _taxonomy_field("material_finish", "Material / Finish", "main material or exterior finish when supported", "材质 / 表面处理"), | |
| 192 | + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"), | |
| 193 | + _taxonomy_field("use_scenario", "Use Scenario", "intended use such as cooking, cleaning, grooming, cooling, air treatment", "使用场景"), | |
| 194 | +) | |
| 195 | + | |
| 196 | +HOME_LIVING_TAXONOMY_FIELDS = ( | |
| 197 | + _taxonomy_field("product_type", "Product Type", "concise home and living category label", "品类"), | |
| 198 | + _taxonomy_field("room_placement", "Room / Placement", "intended room or placement such as bedroom, kitchen, bathroom, desktop", "适用空间 / 摆放位置"), | |
| 199 | + _taxonomy_field("material", "Material", "main material such as wood, ceramic, cotton, glass, metal", "材质"), | |
| 200 | + _taxonomy_field("style", "Style", "home style such as modern, farmhouse, minimalist, boho, Nordic", "风格"), | |
| 201 | + _taxonomy_field("size_dimensions", "Size / Dimensions", "size or dimensions when stated", "尺寸 / 规格"), | |
| 202 | + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"), | |
| 203 | + _taxonomy_field("pattern_finish", "Pattern / Finish", "surface pattern or finish such as solid, marble, matte, ribbed", "图案 / 表面处理"), | |
| 204 | + _taxonomy_field("key_features", "Key Features", "main product features such as stackable, washable, blackout, space-saving", "关键特征"), | |
| 205 | + _taxonomy_field("assembly_installation", "Assembly / Installation", "assembly or installation trait when supported", "组装 / 安装"), | |
| 206 | + _taxonomy_field("use_scenario", "Use Scenario", "intended use such as storage, dining, decor, sleep, organization", "使用场景"), | |
| 207 | +) | |
| 208 | + | |
| 209 | +WIGS_TAXONOMY_FIELDS = ( | |
| 210 | + _taxonomy_field("product_type", "Product Type", "concise wig or hairpiece category label", "品类"), | |
| 211 | + _taxonomy_field("hair_material", "Hair Material", "hair material such as human hair, synthetic fiber, heat-resistant fiber", "发丝材质"), | |
| 212 | + _taxonomy_field("hair_texture", "Hair Texture", "texture or curl pattern such as straight, body wave, curly, kinky", "发质纹理"), | |
| 213 | + _taxonomy_field("hair_length", "Hair Length", "hair length when stated", "发长"), | |
| 214 | + _taxonomy_field("hair_color", "Hair Color", "specific hair color or blend when available", "发色"), | |
| 215 | + _taxonomy_field("cap_construction", "Cap Construction", "cap type such as full lace, lace front, glueless, U part", "帽网结构"), | |
| 216 | + _taxonomy_field("lace_area_part_type", "Lace Area / Part Type", "lace size or part style such as 13x4 lace, middle part, T part", "蕾丝面积 / 分缝类型"), | |
| 217 | + _taxonomy_field("density_volume", "Density / Volume", "hair density or fullness when supported", "密度 / 发量"), | |
| 218 | + _taxonomy_field("style_bang_type", "Style / Bang Type", "style cue such as bob, pixie, layered, with bangs", "款式 / 刘海类型"), | |
| 219 | + _taxonomy_field("occasion_end_use", "Occasion / End Use", "intended use such as daily wear, cosplay, protective style, party", "适用场景"), | |
| 220 | +) | |
| 221 | + | |
| 222 | +BEAUTY_TAXONOMY_FIELDS = ( | |
| 223 | + _taxonomy_field("product_type", "Product Type", "concise beauty or cosmetics category label", "品类"), | |
| 224 | + _taxonomy_field("target_area", "Target Area", "target area such as face, lips, eyes, nails, hair, body", "适用部位"), | |
| 225 | + _taxonomy_field("skin_hair_type", "Skin Type / Hair Type", "suitable skin or hair type when supported", "肤质 / 发质"), | |
| 226 | + _taxonomy_field("finish_effect", "Finish / Effect", "cosmetic finish or effect such as matte, dewy, volumizing, brightening", "妆效 / 效果"), | |
| 227 | + _taxonomy_field("key_ingredients", "Key Ingredients", "notable ingredients when stated", "关键成分"), | |
| 228 | + _taxonomy_field("shade_color", "Shade / Color", "specific shade or color when available", "色号 / 颜色"), | |
| 229 | + _taxonomy_field("scent", "Scent", "fragrance or scent only when supported", "香味"), | |
| 230 | + _taxonomy_field("formulation", "Formulation", "product form such as cream, serum, powder, gel, stick", "剂型 / 形态"), | |
| 231 | + _taxonomy_field("functional_benefits", "Functional Benefits", "benefits such as hydration, anti-aging, long-wear, repair, sun protection", "功能"), | |
| 232 | + _taxonomy_field("use_scenario", "Use Scenario", "intended use such as daily routine, salon, travel, evening makeup", "使用场景"), | |
| 233 | +) | |
| 234 | + | |
| 235 | +ACCESSORIES_TAXONOMY_FIELDS = ( | |
| 236 | + _taxonomy_field("product_type", "Product Type", "concise accessory category label such as necklace, watch, belt, hat, or sunglasses", "品类"), | |
| 237 | + _taxonomy_field("target_gender", "Target Gender", "intended gender only if clearly implied", "目标性别"), | |
| 238 | + _taxonomy_field("material", "Material", "main material such as alloy, leather, stainless steel, acetate, fabric", "材质"), | |
| 239 | + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"), | |
| 240 | + _taxonomy_field("pattern_finish", "Pattern / Finish", "surface treatment or style finish such as polished, textured, braided, rhinestone", "图案 / 表面处理"), | |
| 241 | + _taxonomy_field("closure_fastening", "Closure / Fastening", "fastening method when applicable", "闭合 / 固定方式"), | |
| 242 | + _taxonomy_field("size_fit", "Size / Fit", "size or fit information such as adjustable, one size, 42mm", "尺寸 / 适配"), | |
| 243 | + _taxonomy_field("style", "Style", "style cue such as minimalist, vintage, statement, sporty", "风格"), | |
| 244 | + _taxonomy_field("occasion_end_use", "Occasion / End Use", "likely use such as daily wear, formal, party, travel, sun protection", "适用场景"), | |
| 245 | + _taxonomy_field("set_pack_size", "Set / Pack Size", "set count or pack size when stated", "套装 / 规格"), | |
| 246 | +) | |
| 247 | + | |
| 248 | +TOYS_TAXONOMY_FIELDS = ( | |
| 249 | + _taxonomy_field("product_type", "Product Type", "concise toy category label", "品类"), | |
| 250 | + _taxonomy_field("age_group", "Age Group", "intended age group when clearly implied", "年龄段"), | |
| 251 | + _taxonomy_field("character_theme", "Character / Theme", "licensed character, theme, or play theme when supported", "角色 / 主题"), | |
| 252 | + _taxonomy_field("material", "Material", "main toy material such as plush, plastic, wood, silicone", "材质"), | |
| 253 | + _taxonomy_field("power_source", "Power Source", "battery, rechargeable, wind-up, or non-powered when supported", "供电方式"), | |
| 254 | + _taxonomy_field("interactive_features", "Interactive Features", "interactive functions such as sound, lights, remote control, motion", "互动功能"), | |
| 255 | + _taxonomy_field("educational_play_value", "Educational / Play Value", "play value such as STEM, pretend play, sensory, puzzle solving", "教育 / 可玩性"), | |
| 256 | + _taxonomy_field("piece_count_size", "Piece Count / Size", "piece count or size when stated", "件数 / 尺寸"), | |
| 257 | + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"), | |
| 258 | + _taxonomy_field("use_scenario", "Use Scenario", "intended use such as indoor play, bath time, party favor, outdoor play", "使用场景"), | |
| 259 | +) | |
| 260 | + | |
| 261 | +SHOES_TAXONOMY_FIELDS = ( | |
| 262 | + _taxonomy_field("product_type", "Product Type", "concise footwear category label", "品类"), | |
| 263 | + _taxonomy_field("target_gender", "Target Gender", "intended gender only if clearly implied", "目标性别"), | |
| 264 | + _taxonomy_field("age_group", "Age Group", "only if clearly implied", "年龄段"), | |
| 265 | + _taxonomy_field("closure_type", "Closure Type", "fastening method such as lace-up, slip-on, buckle, hook-and-loop", "闭合方式"), | |
| 266 | + _taxonomy_field("toe_shape", "Toe Shape", "toe shape when applicable, e.g. round toe, pointed toe, open toe", "鞋头形状"), | |
| 267 | + _taxonomy_field("heel_sole_type", "Heel Height / Sole Type", "heel or sole profile such as flat, block heel, wedge, platform, thick sole", "跟高 / 鞋底类型"), | |
| 268 | + _taxonomy_field("upper_material", "Upper Material", "main upper material such as leather, knit, canvas, mesh", "鞋面材质"), | |
| 269 | + _taxonomy_field("lining_insole_material", "Lining / Insole Material", "lining or insole material when supported", "里料 / 鞋垫材质"), | |
| 270 | + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"), | |
| 271 | + _taxonomy_field("occasion_end_use", "Occasion / End Use", "likely use such as running, casual, office, hiking, formal", "适用场景"), | |
| 272 | +) | |
| 273 | + | |
| 274 | +SPORTS_TAXONOMY_FIELDS = ( | |
| 275 | + _taxonomy_field("product_type", "Product Type", "concise sports product category label", "品类"), | |
| 276 | + _taxonomy_field("sport_activity", "Sport / Activity", "primary sport or activity such as fitness, yoga, basketball, cycling, swimming", "运动 / 活动"), | |
| 277 | + _taxonomy_field("skill_level", "Skill Level", "target user level when supported, e.g. beginner, training, professional", "适用水平"), | |
| 278 | + _taxonomy_field("material", "Material", "main material such as EVA, carbon fiber, neoprene, latex", "材质"), | |
| 279 | + _taxonomy_field("size_capacity", "Size / Capacity", "size, weight, resistance level, or capacity when stated", "尺寸 / 容量"), | |
| 280 | + _taxonomy_field("protection_support", "Protection / Support", "support or protection function such as ankle support, shock absorption, impact protection", "防护 / 支撑"), | |
| 281 | + _taxonomy_field("key_features", "Key Features", "main features such as anti-slip, adjustable, foldable, quick-dry", "关键特征"), | |
| 282 | + _taxonomy_field("power_source", "Power Source", "battery, electric, or non-powered when applicable", "供电方式"), | |
| 283 | + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"), | |
| 284 | + _taxonomy_field("use_scenario", "Use Scenario", "intended use such as gym, home workout, field training, competition", "使用场景"), | |
| 285 | +) | |
| 286 | + | |
| 287 | +OTHERS_TAXONOMY_FIELDS = ( | |
| 288 | + _taxonomy_field("product_type", "Product Type", "concise product category label, not a full marketing title", "品类"), | |
| 289 | + _taxonomy_field("product_category", "Product Category", "broader retail grouping when the specific product type is narrow", "商品类别"), | |
| 290 | + _taxonomy_field("target_user", "Target User", "intended user, audience, or recipient when clearly implied", "适用人群"), | |
| 291 | + _taxonomy_field("material_ingredients", "Material / Ingredients", "main material or ingredients when supported", "材质 / 成分"), | |
| 292 | + _taxonomy_field("key_features", "Key Features", "primary product attributes or standout features", "关键特征"), | |
| 293 | + _taxonomy_field("functional_benefits", "Functional Benefits", "practical benefits or performance advantages when supported", "功能"), | |
| 294 | + _taxonomy_field("size_capacity", "Size / Capacity", "size, count, weight, or capacity when stated", "尺寸 / 容量"), | |
| 295 | + _taxonomy_field("color", "Color", "specific color name when available", "主颜色"), | |
| 296 | + _taxonomy_field("style_theme", "Style / Theme", "overall style, design theme, or visual direction when supported", "风格 / 主题"), | |
| 297 | + _taxonomy_field("use_scenario", "Use Scenario", "likely use occasion or application setting when supported", "使用场景"), | |
| 298 | +) | |
| 299 | + | |
| 300 | +CATEGORY_TAXONOMY_PROFILES: Dict[str, Dict[str, Any]] = { | |
| 301 | + "apparel": _make_taxonomy_profile( | |
| 302 | + "apparel", | |
| 303 | + APPAREL_TAXONOMY_FIELDS, | |
| 304 | + ), | |
| 305 | + "3c": _make_taxonomy_profile( | |
| 306 | + "3C", | |
| 307 | + THREE_C_TAXONOMY_FIELDS, | |
| 308 | + ), | |
| 309 | + "bags": _make_taxonomy_profile( | |
| 310 | + "bags", | |
| 311 | + BAGS_TAXONOMY_FIELDS, | |
| 312 | + ), | |
| 313 | + "pet_supplies": _make_taxonomy_profile( | |
| 314 | + "pet supplies", | |
| 315 | + PET_SUPPLIES_TAXONOMY_FIELDS, | |
| 316 | + ), | |
| 317 | + "electronics": _make_taxonomy_profile( | |
| 318 | + "electronics", | |
| 319 | + ELECTRONICS_TAXONOMY_FIELDS, | |
| 320 | + ), | |
| 321 | + "outdoor": _make_taxonomy_profile( | |
| 322 | + "outdoor products", | |
| 323 | + OUTDOOR_TAXONOMY_FIELDS, | |
| 324 | + ), | |
| 325 | + "home_appliances": _make_taxonomy_profile( | |
| 326 | + "home appliances", | |
| 327 | + HOME_APPLIANCES_TAXONOMY_FIELDS, | |
| 328 | + ), | |
| 329 | + "home_living": _make_taxonomy_profile( | |
| 330 | + "home and living", | |
| 331 | + HOME_LIVING_TAXONOMY_FIELDS, | |
| 332 | + ), | |
| 333 | + "wigs": _make_taxonomy_profile( | |
| 334 | + "wigs", | |
| 335 | + WIGS_TAXONOMY_FIELDS, | |
| 336 | + ), | |
| 337 | + "beauty": _make_taxonomy_profile( | |
| 338 | + "beauty and cosmetics", | |
| 339 | + BEAUTY_TAXONOMY_FIELDS, | |
| 340 | + ), | |
| 341 | + "accessories": _make_taxonomy_profile( | |
| 342 | + "accessories", | |
| 343 | + ACCESSORIES_TAXONOMY_FIELDS, | |
| 344 | + ), | |
| 345 | + "toys": _make_taxonomy_profile( | |
| 346 | + "toys", | |
| 347 | + TOYS_TAXONOMY_FIELDS, | |
| 348 | + ), | |
| 349 | + "shoes": _make_taxonomy_profile( | |
| 350 | + "shoes", | |
| 351 | + SHOES_TAXONOMY_FIELDS, | |
| 352 | + ), | |
| 353 | + "sports": _make_taxonomy_profile( | |
| 354 | + "sports products", | |
| 355 | + SPORTS_TAXONOMY_FIELDS, | |
| 356 | + ), | |
| 357 | + "others": _make_taxonomy_profile( | |
| 358 | + "general merchandise", | |
| 359 | + OTHERS_TAXONOMY_FIELDS, | |
| 360 | + ), | |
| 361 | +} | |
| 362 | + | |
| 363 | +TAXONOMY_SHARED_ANALYSIS_INSTRUCTION = CATEGORY_TAXONOMY_PROFILES["apparel"]["shared_instruction"] | |
| 364 | +TAXONOMY_MARKDOWN_TABLE_HEADERS_EN = CATEGORY_TAXONOMY_PROFILES["apparel"]["markdown_table_headers"]["en"] | |
| 365 | +TAXONOMY_LANGUAGE_MARKDOWN_TABLE_HEADERS: Dict[str, Dict[str, Any]] = CATEGORY_TAXONOMY_PROFILES["apparel"]["markdown_table_headers"] | |
| 366 | + | |
| 36 | 367 | LANGUAGE_MARKDOWN_TABLE_HEADERS: Dict[str, Dict[str, Any]] = { |
| 37 | 368 | "en": [ |
| 38 | 369 | "No.", | ... | ... |
| ... | ... | @@ -0,0 +1,173 @@ |
| 1 | +# 内容富化模块说明 | |
| 2 | + | |
| 3 | +本文说明商品内容富化模块的职责、入口、输出结构,以及当前 taxonomy profile 的设计约束。 | |
| 4 | + | |
| 5 | +## 1. 模块目标 | |
| 6 | + | |
| 7 | +内容富化模块负责基于商品文本调用 LLM,生成以下索引字段: | |
| 8 | + | |
| 9 | +- `qanchors` | |
| 10 | +- `enriched_tags` | |
| 11 | +- `enriched_attributes` | |
| 12 | +- `enriched_taxonomy_attributes` | |
| 13 | + | |
| 14 | +模块追求的设计原则: | |
| 15 | + | |
| 16 | +- 单一职责:只负责内容理解与结构化输出,不负责 CSV 读写 | |
| 17 | +- 输出对齐 ES mapping:返回结构可直接写入 `search_products` | |
| 18 | +- 配置化扩展:taxonomy profile 通过数据配置扩展,而不是散落条件分支 | |
| 19 | +- 代码精简:只面向正常使用方式,避免为了不合理调用堆叠补丁逻辑 | |
| 20 | + | |
| 21 | +## 2. 主要文件 | |
| 22 | + | |
| 23 | +- [product_enrich.py](/data/saas-search/indexer/product_enrich.py) | |
| 24 | + 运行时主逻辑,负责批处理、缓存、prompt 组装、LLM 调用、markdown 解析、输出整理 | |
| 25 | +- [product_enrich_prompts.py](/data/saas-search/indexer/product_enrich_prompts.py) | |
| 26 | + prompt 模板与 taxonomy profile 配置 | |
| 27 | +- [document_transformer.py](/data/saas-search/indexer/document_transformer.py) | |
| 28 | + 在内部索引构建链路中调用内容富化模块,把结果回填到 ES doc | |
| 29 | +- [taxonomy.md](/data/saas-search/indexer/taxonomy.md) | |
| 30 | + taxonomy 设计说明与字段清单 | |
| 31 | + | |
| 32 | +## 3. 对外入口 | |
| 33 | + | |
| 34 | +### 3.1 Python 入口 | |
| 35 | + | |
| 36 | +核心入口: | |
| 37 | + | |
| 38 | +```python | |
| 39 | +build_index_content_fields( | |
| 40 | + items, | |
| 41 | + tenant_id=None, | |
| 42 | + enrichment_scopes=None, | |
| 43 | + category_taxonomy_profile=None, | |
| 44 | +) | |
| 45 | +``` | |
| 46 | + | |
| 47 | +输入最小要求: | |
| 48 | + | |
| 49 | +- `id` 或 `spu_id` | |
| 50 | +- `title` | |
| 51 | + | |
| 52 | +可选输入: | |
| 53 | + | |
| 54 | +- `brief` | |
| 55 | +- `description` | |
| 56 | +- `image_url` | |
| 57 | + | |
| 58 | +关键参数: | |
| 59 | + | |
| 60 | +- `enrichment_scopes` | |
| 61 | + 可选 `generic`、`category_taxonomy` | |
| 62 | +- `category_taxonomy_profile` | |
| 63 | + taxonomy profile;默认 `apparel` | |
| 64 | + | |
| 65 | +### 3.2 HTTP 入口 | |
| 66 | + | |
| 67 | +API 路由: | |
| 68 | + | |
| 69 | +- `POST /indexer/enrich-content` | |
| 70 | + | |
| 71 | +对应文档: | |
| 72 | + | |
| 73 | +- [搜索API对接指南-05-索引接口(Indexer)](/data/saas-search/docs/搜索API对接指南-05-索引接口(Indexer).md) | |
| 74 | +- [搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation)](/data/saas-search/docs/搜索API对接指南-07-微服务接口(Embedding-Reranker-Translation).md) | |
| 75 | + | |
| 76 | +## 4. 输出结构 | |
| 77 | + | |
| 78 | +返回结果与 ES mapping 对齐: | |
| 79 | + | |
| 80 | +```json | |
| 81 | +{ | |
| 82 | + "id": "223167", | |
| 83 | + "qanchors": { | |
| 84 | + "zh": ["短袖T恤", "纯棉"], | |
| 85 | + "en": ["t-shirt", "cotton"] | |
| 86 | + }, | |
| 87 | + "enriched_tags": { | |
| 88 | + "zh": ["短袖", "纯棉"], | |
| 89 | + "en": ["short sleeve", "cotton"] | |
| 90 | + }, | |
| 91 | + "enriched_attributes": [ | |
| 92 | + { | |
| 93 | + "name": "enriched_tags", | |
| 94 | + "value": { | |
| 95 | + "zh": ["短袖", "纯棉"], | |
| 96 | + "en": ["short sleeve", "cotton"] | |
| 97 | + } | |
| 98 | + } | |
| 99 | + ], | |
| 100 | + "enriched_taxonomy_attributes": [ | |
| 101 | + { | |
| 102 | + "name": "Product Type", | |
| 103 | + "value": { | |
| 104 | + "zh": ["T恤"], | |
| 105 | + "en": ["t-shirt"] | |
| 106 | + } | |
| 107 | + } | |
| 108 | + ] | |
| 109 | +} | |
| 110 | +``` | |
| 111 | + | |
| 112 | +说明: | |
| 113 | + | |
| 114 | +- `generic` 部分固定输出核心索引语言 `zh`、`en` | |
| 115 | +- `taxonomy` 部分同样统一输出 `zh`、`en` | |
| 116 | + | |
| 117 | +## 5. Taxonomy profile | |
| 118 | + | |
| 119 | +当前支持: | |
| 120 | + | |
| 121 | +- `apparel` | |
| 122 | +- `3c` | |
| 123 | +- `bags` | |
| 124 | +- `pet_supplies` | |
| 125 | +- `electronics` | |
| 126 | +- `outdoor` | |
| 127 | +- `home_appliances` | |
| 128 | +- `home_living` | |
| 129 | +- `wigs` | |
| 130 | +- `beauty` | |
| 131 | +- `accessories` | |
| 132 | +- `toys` | |
| 133 | +- `shoes` | |
| 134 | +- `sports` | |
| 135 | +- `others` | |
| 136 | + | |
| 137 | +统一约束: | |
| 138 | + | |
| 139 | +- 所有 profile 都返回 `zh` + `en` | |
| 140 | +- profile 只决定 taxonomy 字段集合,不再决定输出语言 | |
| 141 | +- 所有 profile 都配置中英文字段名,prompt/header 结构保持一致 | |
| 142 | + | |
| 143 | +## 6. 内部索引链路的当前约束 | |
| 144 | + | |
| 145 | +在内部 ES 文档构建链路里,`document_transformer` 当前调用内容富化时,taxonomy profile 暂时固定传: | |
| 146 | + | |
| 147 | +```python | |
| 148 | +category_taxonomy_profile="apparel" | |
| 149 | +``` | |
| 150 | + | |
| 151 | +这是一种显式、可控、代码更干净的临时策略。 | |
| 152 | + | |
| 153 | +当前代码里已保留 TODO: | |
| 154 | + | |
| 155 | +- 后续从数据库读取租户真实所属行业 | |
| 156 | +- 再用该行业替换固定的 `apparel` | |
| 157 | + | |
| 158 | +当前不做“根据商品类目文本自动猜 profile”的隐式逻辑,避免增加冗余代码与不必要的不确定性。 | |
| 159 | + | |
| 160 | +## 7. 缓存与批处理 | |
| 161 | + | |
| 162 | +缓存键由以下信息共同决定: | |
| 163 | + | |
| 164 | +- `analysis_kind` | |
| 165 | +- `target_lang` | |
| 166 | +- prompt/schema 版本指纹 | |
| 167 | +- prompt 实际输入文本 | |
| 168 | + | |
| 169 | +批处理规则: | |
| 170 | + | |
| 171 | +- 单次 LLM 调用最多 20 条 | |
| 172 | +- 上层允许传更大批次,模块内部自动拆批 | |
| 173 | +- uncached batch 可并发执行 | ... | ... |
| ... | ... | @@ -0,0 +1,196 @@ |
| 1 | + | |
| 2 | +# Cross-Border E-commerce Core Categories 大类 | |
| 3 | + | |
| 4 | +## 1. 3C | |
| 5 | +Phone accessories, computer peripherals, smart wearables, audio & video, smart home, gaming gear. 手机配件、电脑周边、智能穿戴、影音娱乐、智能家居、游戏设备。 | |
| 6 | + | |
| 7 | +## 2. Bags 包 | |
| 8 | +Handbags, backpacks, wallets, luggage, crossbody bags, tote bags. 手提包、双肩包、钱包、行李箱、斜挎包、托特包。 | |
| 9 | + | |
| 10 | +## 3. Pet Supplies 宠物用品 | |
| 11 | +Pet food, pet toys, pet care products, pet grooming, pet clothing, smart pet devices. 宠物食品、宠物玩具、宠物护理用品、宠物美容、宠物服装、智能宠物设备。 | |
| 12 | + | |
| 13 | +## 4. Electronics 电子产品 | |
| 14 | +Consumer electronics, home appliances, digital devices, cables & chargers, batteries, electronic components. 消费电子产品、家用电器、数码设备、线材充电器、电池、电子元器件。 | |
| 15 | + | |
| 16 | +## 5. Clothing 服装 | |
| 17 | +Women's wear, men's wear, kid's wear, underwear, outerwear, activewear. 女装、男装、童装、内衣、外套、运动服装。 | |
| 18 | + | |
| 19 | +## 6. Outdoor 户外用品 | |
| 20 | +Camping gear, hiking equipment, fishing supplies, outdoor clothing, travel accessories, survival tools. 露营装备、徒步用品、渔具、户外服装、旅行配件、求生工具。 | |
| 21 | + | |
| 22 | +## 7. Home Appliances 家电/电器 | |
| 23 | +Kitchen appliances, cleaning appliances, personal care appliances, heating & cooling, smart home devices. 厨房电器、清洁电器、个护电器、冷暖设备、智能家居设备。 | |
| 24 | + | |
| 25 | +## 8. Home & Living 家居 | |
| 26 | +Furniture, home textiles, lighting, kitchenware, storage, home decor. 家具、家纺、灯具、厨具、收纳、家居装饰。 | |
| 27 | + | |
| 28 | +## 9. Wigs 假发 | |
| 29 | + | |
| 30 | +## 10. Beauty & Cosmetics 美容美妆 | |
| 31 | +Skincare, makeup, nail care, beauty tools, hair care, fragrances. 护肤品、彩妆、美甲、美容工具、护发、香水。 | |
| 32 | + | |
| 33 | +## 11. Accessories 配饰 | |
| 34 | +Jewelry, watches, belts, scarves, hats, sunglasses, hair accessories. 珠宝、手表、腰带、围巾、帽子、太阳镜、发饰。 | |
| 35 | + | |
| 36 | +## 12. Toys 玩具 | |
| 37 | +Educational toys, plush toys, action figures, puzzles, outdoor toys, DIY toys. 益智玩具、毛绒玩具、可动人偶、拼图、户外玩具、DIY玩具。 | |
| 38 | + | |
| 39 | +## 13. Shoes 鞋子 | |
| 40 | +Sneakers, boots, sandals, heels, flats, sports shoes. 运动鞋、靴子、凉鞋、高跟鞋、平底鞋、球鞋。 | |
| 41 | + | |
| 42 | +## 14. Sports 运动产品 | |
| 43 | +Fitness equipment, sports gear, team sports, racquet sports, water sports, cycling. 健身器材、运动装备、团队运动、球拍运动、水上运动、骑行。 | |
| 44 | + | |
| 45 | +## 15. Others 其他 | |
| 46 | + | |
| 47 | +# 各个大类的taxonomy | |
| 48 | +## 1. Clothing & Apparel 服装 | |
| 49 | + | |
| 50 | +### A. Product Classification | |
| 51 | + | |
| 52 | +| 一级层级 | 中文列名 | English Column Name | | |
| 53 | +| ------------------------- | ---- | ------------------- | | |
| 54 | +| A. Product Classification | 品类 | Product Type | | |
| 55 | +| A. Product Classification | 目标性别 | Target Gender | | |
| 56 | +| A. Product Classification | 年龄段 | Age Group | | |
| 57 | +| A. Product Classification | 适用季节 | Season | | |
| 58 | + | |
| 59 | +### B. Garment Design | |
| 60 | + | |
| 61 | +| 一级层级 | 中文列名 | English Column Name | | |
| 62 | +| ----------------- | ---- | ------------------- | | |
| 63 | +| B. Garment Design | 版型 | Fit | | |
| 64 | +| B. Garment Design | 廓形 | Silhouette | | |
| 65 | +| B. Garment Design | 领型 | Neckline | | |
| 66 | +| B. Garment Design | 袖型 | Sleeve Style | | |
| 67 | +| B. Garment Design | 肩带设计 | Strap Type | | |
| 68 | +| B. Garment Design | 腰型 | Rise / Waistline | | |
| 69 | +| B. Garment Design | 裤型 | Leg Shape | | |
| 70 | +| B. Garment Design | 裙型 | Skirt Shape | | |
| 71 | +| B. Garment Design | 长度 | Length Type | | |
| 72 | +| B. Garment Design | 闭合方式 | Closure Type | | |
| 73 | +| B. Garment Design | 设计细节 | Design Details | | |
| 74 | + | |
| 75 | +### C. Material & Performance | |
| 76 | + | |
| 77 | +| 一级层级 | 中文列名 | English Column Name | | |
| 78 | +| ------------------------- | ----------- | -------------------- | | |
| 79 | +| C. Material & Performance | 面料 | Fabric | | |
| 80 | +| C. Material & Performance | 成分 | Material Composition | | |
| 81 | +| C. Material & Performance | 面料特性 | Fabric Properties | | |
| 82 | +| C. Material & Performance | 服装特征 / 功能细节 | Clothing Features | | |
| 83 | +| C. Material & Performance | 功能 | Functional Benefits | | |
| 84 | + | |
| 85 | +### D. Merchandising Attributes | |
| 86 | + | |
| 87 | +| 一级层级 | 中文列名 | English Column Name | | |
| 88 | +| --------------------------- | ------- | ------------------- | | |
| 89 | +| D. Merchandising Attributes | 主颜色 | Color | | |
| 90 | +| D. Merchandising Attributes | 色系 | Color Family | | |
| 91 | +| D. Merchandising Attributes | 印花 / 图案 | Print / Pattern | | |
| 92 | +| D. Merchandising Attributes | 适用场景 | Occasion / End Use | | |
| 93 | +| D. Merchandising Attributes | 风格 | Style Aesthetic | | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | +根据这个产生 | |
| 98 | +enriched_taxonomy_attributes | |
| 99 | + | |
| 100 | +```python | |
| 101 | +Product Type | |
| 102 | +Target Gender | |
| 103 | +Age Group | |
| 104 | +Season | |
| 105 | +Fit | |
| 106 | +Silhouette | |
| 107 | +Neckline | |
| 108 | +Sleeve Length Type | |
| 109 | +Sleeve Style | |
| 110 | +Strap Type | |
| 111 | +Rise / Waistline | |
| 112 | +Leg Shape | |
| 113 | +Skirt Shape | |
| 114 | +Length Type | |
| 115 | +Closure Type | |
| 116 | +Design Details | |
| 117 | +Fabric | |
| 118 | +Material Composition | |
| 119 | +Fabric Properties | |
| 120 | +Clothing Features | |
| 121 | +Functional Benefits | |
| 122 | +Color | |
| 123 | +Color Family | |
| 124 | +Print / Pattern | |
| 125 | +Occasion / End Use | |
| 126 | +Style Aesthetic | |
| 127 | +``` | |
| 128 | + | |
| 129 | +提示词: | |
| 130 | + | |
| 131 | +```python | |
| 132 | +SHARED_ANALYSIS_INSTRUCTION = """ | |
| 133 | +Analyze each input product text and fill the columns below using an apparel attribute taxonomy. | |
| 134 | + | |
| 135 | +Output columns: | |
| 136 | +1. Product Type: concise ecommerce apparel category label, not a full marketing title | |
| 137 | +2. Target Gender: intended gender only if clearly implied | |
| 138 | +3. Age Group: only if clearly implied, e.g. adults, kids, teens, toddlers, babies | |
| 139 | +4. Season: season(s) or all-season suitability only if supported | |
| 140 | +5. Fit: body closeness, e.g. slim, regular, relaxed, oversized, fitted | |
| 141 | +6. Silhouette: overall garment shape, e.g. straight, A-line, boxy, tapered, bodycon, wide-leg | |
| 142 | +7. Neckline: neckline type when applicable, e.g. crew neck, V-neck, hooded, collared, square neck | |
| 143 | +8. Sleeve Length Type: sleeve length only, e.g. sleeveless, short sleeve, long sleeve, three-quarter sleeve | |
| 144 | +9. Sleeve Style: sleeve design only, e.g. puff sleeve, raglan sleeve, batwing sleeve, bell sleeve | |
| 145 | +10. Strap Type: strap design when applicable, e.g. spaghetti strap, wide strap, halter strap, adjustable strap | |
| 146 | +11. Rise / Waistline: waist placement when applicable, e.g. high rise, mid rise, low rise, empire waist | |
| 147 | +12. Leg Shape: for bottoms only, e.g. straight leg, wide leg, flare leg, tapered leg, skinny leg | |
| 148 | +13. Skirt Shape: for skirts only, e.g. A-line, pleated, pencil, mermaid | |
| 149 | +14. Length Type: design length only, not size, e.g. cropped, regular, longline, mini, midi, maxi, ankle length, full length | |
| 150 | +15. Closure Type: fastening method when applicable, e.g. zipper, button, drawstring, elastic waist, hook-and-loop | |
| 151 | +16. Design Details: construction or visual details, e.g. ruched, ruffled, pleated, cut-out, layered, distressed, split hem | |
| 152 | +17. Fabric: fabric type only, e.g. denim, knit, chiffon, jersey, fleece, cotton twill | |
| 153 | +18. Material Composition: fiber content or blend only if stated, e.g. cotton, polyester, spandex, linen blend, 95% cotton 5% elastane | |
| 154 | +19. Fabric Properties: inherent fabric traits, e.g. stretch, breathable, lightweight, soft-touch, water-resistant | |
| 155 | +20. Clothing Features: product features, e.g. lined, reversible, hooded, packable, padded, pocketed | |
| 156 | +21. Functional Benefits: wearer benefits, e.g. moisture-wicking, thermal insulation, UV protection, easy care, supportive compression | |
| 157 | +22. Color: specific color name when available | |
| 158 | +23. Color Family: normalized broad retail color group, e.g. black, white, blue, green, red, pink, beige, brown, gray | |
| 159 | +24. Print / Pattern: surface pattern when applicable, e.g. solid, striped, plaid, floral, graphic, animal print | |
| 160 | +25. Occasion / End Use: likely use occasion only if supported, e.g. office, casual wear, streetwear, lounge, workout, outdoor | |
| 161 | +26. Style Aesthetic: overall style only if supported, e.g. minimalist, streetwear, athleisure, smart casual, romantic, playful | |
| 162 | + | |
| 163 | +Rules: | |
| 164 | +- Keep the same row order and row count as input. | |
| 165 | +- Infer only from the provided product text. | |
| 166 | +- Leave blank if not applicable or not reasonably supported. | |
| 167 | +- Use concise, standardized English ecommerce wording. | |
| 168 | +- Do not combine different attribute dimensions in one field. | |
| 169 | +- If multiple values are needed, use the delimiter required by the localization setting. | |
| 170 | + | |
| 171 | +Input product list: | |
| 172 | +""" | |
| 173 | +``` | |
| 174 | + | |
| 175 | +## 2. Other taxonomy profiles | |
| 176 | + | |
| 177 | +说明: | |
| 178 | +- 所有 profile 统一返回 `zh` + `en`。 | |
| 179 | +- 代码中的 profile slug 与下面保持一致。 | |
| 180 | + | |
| 181 | +| Profile | Core columns (`en`) | | |
| 182 | +| --- | --- | | |
| 183 | +| `3c` | Product Type, Compatible Device / Model, Connectivity, Interface / Port Type, Power Source / Charging, Key Features, Material / Finish, Color, Pack Size, Use Case | | |
| 184 | +| `bags` | Product Type, Target Gender, Carry Style, Size / Capacity, Material, Closure Type, Structure / Compartments, Strap / Handle Type, Color, Occasion / End Use | | |
| 185 | +| `pet_supplies` | Product Type, Pet Type, Breed Size, Life Stage, Material / Ingredients, Flavor / Scent, Key Features, Functional Benefits, Size / Capacity, Use Scenario | | |
| 186 | +| `electronics` | Product Type, Device Category / Compatibility, Power / Voltage, Connectivity, Interface / Port Type, Capacity / Storage, Key Features, Material / Finish, Color, Use Case | | |
| 187 | +| `outdoor` | Product Type, Activity Type, Season / Weather, Material, Capacity / Size, Protection / Resistance, Key Features, Portability / Packability, Color, Use Scenario | | |
| 188 | +| `home_appliances` | Product Type, Appliance Category, Power / Voltage, Capacity / Coverage, Control Method, Installation Type, Key Features, Material / Finish, Color, Use Scenario | | |
| 189 | +| `home_living` | Product Type, Room / Placement, Material, Style, Size / Dimensions, Color, Pattern / Finish, Key Features, Assembly / Installation, Use Scenario | | |
| 190 | +| `wigs` | Product Type, Hair Material, Hair Texture, Hair Length, Hair Color, Cap Construction, Lace Area / Part Type, Density / Volume, Style / Bang Type, Occasion / End Use | | |
| 191 | +| `beauty` | Product Type, Target Area, Skin Type / Hair Type, Finish / Effect, Key Ingredients, Shade / Color, Scent, Formulation, Functional Benefits, Use Scenario | | |
| 192 | +| `accessories` | Product Type, Target Gender, Material, Color, Pattern / Finish, Closure / Fastening, Size / Fit, Style, Occasion / End Use, Set / Pack Size | | |
| 193 | +| `toys` | Product Type, Age Group, Character / Theme, Material, Power Source, Interactive Features, Educational / Play Value, Piece Count / Size, Color, Use Scenario | | |
| 194 | +| `shoes` | Product Type, Target Gender, Age Group, Closure Type, Toe Shape, Heel Height / Sole Type, Upper Material, Lining / Insole Material, Color, Occasion / End Use | | |
| 195 | +| `sports` | Product Type, Sport / Activity, Skill Level, Material, Size / Capacity, Protection / Support, Key Features, Power Source, Color, Use Scenario | | |
| 196 | +| `others` | Product Type, Product Category, Target User, Material / Ingredients, Key Features, Functional Benefits, Size / Capacity, Color, Style / Theme, Use Scenario | | ... | ... |
mappings/README.md
mappings/generate_search_products_mapping.py
| ... | ... | @@ -214,6 +214,11 @@ FIELD_SPECS = [ |
| 214 | 214 | scalar_field("name", "keyword"), |
| 215 | 215 | text_field("value", "core_language_text_with_keyword"), |
| 216 | 216 | ), |
| 217 | + nested_field( | |
| 218 | + "enriched_taxonomy_attributes", | |
| 219 | + scalar_field("name", "keyword"), | |
| 220 | + text_field("value", "core_language_text_with_keyword"), | |
| 221 | + ), | |
| 217 | 222 | scalar_field("option1_name", "keyword"), |
| 218 | 223 | scalar_field("option2_name", "keyword"), |
| 219 | 224 | scalar_field("option3_name", "keyword"), | ... | ... |
mappings/search_products.json
| ... | ... | @@ -2116,6 +2116,40 @@ |
| 2116 | 2116 | } |
| 2117 | 2117 | } |
| 2118 | 2118 | }, |
| 2119 | + "enriched_taxonomy_attributes": { | |
| 2120 | + "type": "nested", | |
| 2121 | + "properties": { | |
| 2122 | + "name": { | |
| 2123 | + "type": "keyword" | |
| 2124 | + }, | |
| 2125 | + "value": { | |
| 2126 | + "type": "object", | |
| 2127 | + "properties": { | |
| 2128 | + "zh": { | |
| 2129 | + "type": "text", | |
| 2130 | + "analyzer": "index_ik", | |
| 2131 | + "search_analyzer": "query_ik", | |
| 2132 | + "fields": { | |
| 2133 | + "keyword": { | |
| 2134 | + "type": "keyword", | |
| 2135 | + "normalizer": "lowercase" | |
| 2136 | + } | |
| 2137 | + } | |
| 2138 | + }, | |
| 2139 | + "en": { | |
| 2140 | + "type": "text", | |
| 2141 | + "analyzer": "english", | |
| 2142 | + "fields": { | |
| 2143 | + "keyword": { | |
| 2144 | + "type": "keyword", | |
| 2145 | + "normalizer": "lowercase" | |
| 2146 | + } | |
| 2147 | + } | |
| 2148 | + } | |
| 2149 | + } | |
| 2150 | + } | |
| 2151 | + } | |
| 2152 | + }, | |
| 2119 | 2153 | "option1_name": { |
| 2120 | 2154 | "type": "keyword" |
| 2121 | 2155 | }, | ... | ... |
perf_reports/20260311/reranker_1000docs/report.md
perf_reports/20260317/translation_local_models/README.md
| 1 | 1 | # Local Translation Model Benchmark Report |
| 2 | 2 | |
| 3 | -Test script: [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py) | |
| 3 | +Test script: [`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py) | |
| 4 | 4 | |
| 5 | 5 | Test time: `2026-03-17` |
| 6 | 6 | |
| ... | ... | @@ -67,7 +67,7 @@ To model online search query translation, we reran NLLB with `batch_size=1`. In |
| 67 | 67 | Command used: |
| 68 | 68 | |
| 69 | 69 | ```bash |
| 70 | -./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \ | |
| 70 | +./.venv-translator/bin/python benchmarks/translation/benchmark_translation_local_models.py \ | |
| 71 | 71 | --single \ |
| 72 | 72 | --model nllb-200-distilled-600m \ |
| 73 | 73 | --source-lang zh \ | ... | ... |
perf_reports/20260318/nllb_t4_product_names_ct2/README.md
| 1 | 1 | # NLLB T4 Product-Name Tuning Summary |
| 2 | 2 | |
| 3 | 3 | 测试脚本: |
| 4 | -- [`scripts/benchmark_nllb_t4_tuning.py`](/data/saas-search/scripts/benchmark_nllb_t4_tuning.py) | |
| 4 | +- [`benchmarks/translation/benchmark_nllb_t4_tuning.py`](/data/saas-search/benchmarks/translation/benchmark_nllb_t4_tuning.py) | |
| 5 | 5 | |
| 6 | 6 | 本轮报告: |
| 7 | 7 | - Markdown:[`nllb_t4_tuning_003608.md`](/data/saas-search/perf_reports/20260318/nllb_t4_product_names_ct2/nllb_t4_tuning_003608.md) | ... | ... |
perf_reports/20260318/translation_local_models/README.md
| 1 | 1 | # Local Translation Model Benchmark Report |
| 2 | 2 | |
| 3 | 3 | 测试脚本: |
| 4 | -- [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py) | |
| 4 | +- [`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py) | |
| 5 | 5 | |
| 6 | 6 | 完整结果: |
| 7 | 7 | - Markdown:[`translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md) |
| ... | ... | @@ -39,7 +39,7 @@ |
| 39 | 39 | |
| 40 | 40 | ```bash |
| 41 | 41 | cd /data/saas-search |
| 42 | -./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \ | |
| 42 | +./.venv-translator/bin/python benchmarks/translation/benchmark_translation_local_models.py \ | |
| 43 | 43 | --suite extended \ |
| 44 | 44 | --disable-cache \ |
| 45 | 45 | --serial-items-per-case 256 \ | ... | ... |
perf_reports/20260318/translation_local_models_ct2/README.md
| 1 | 1 | # Local Translation Model Benchmark Report (CTranslate2) |
| 2 | 2 | |
| 3 | 3 | 测试脚本: |
| 4 | -- [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py) | |
| 4 | +- [`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py) | |
| 5 | 5 | |
| 6 | 6 | 本轮 CT2 结果: |
| 7 | 7 | - Markdown:[`translation_local_models_ct2_extended_233253.md`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2/translation_local_models_ct2_extended_233253.md) |
| ... | ... | @@ -46,7 +46,7 @@ from datetime import datetime |
| 46 | 46 | from pathlib import Path |
| 47 | 47 | from types import SimpleNamespace |
| 48 | 48 | |
| 49 | -from scripts.benchmark_translation_local_models import ( | |
| 49 | +from benchmarks.translation.benchmark_translation_local_models import ( | |
| 50 | 50 | SCENARIOS, |
| 51 | 51 | benchmark_extended_scenario, |
| 52 | 52 | build_environment_info, | ... | ... |
perf_reports/20260318/translation_local_models_ct2_focus/README.md
| 1 | 1 | # Local Translation Model Focused T4 Tuning |
| 2 | 2 | |
| 3 | 3 | 测试脚本: |
| 4 | -- [`scripts/benchmark_translation_local_models_focus.py`](/data/saas-search/scripts/benchmark_translation_local_models_focus.py) | |
| 4 | +- [`benchmarks/translation/benchmark_translation_local_models_focus.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models_focus.py) | |
| 5 | 5 | |
| 6 | 6 | 本轮聚焦结果: |
| 7 | 7 | - Markdown:[`translation_local_models_focus_235018.md`](/data/saas-search/perf_reports/20260318/translation_local_models_ct2_focus/translation_local_models_focus_235018.md) | ... | ... |
perf_reports/README.md
| ... | ... | @@ -4,7 +4,7 @@ |
| 4 | 4 | |
| 5 | 5 | | 脚本 | 用途 | |
| 6 | 6 | |------|------| |
| 7 | -| `scripts/perf_api_benchmark.py` | 搜索后端、向量、翻译、重排等 HTTP 接口压测;支持 `--embed-text-priority` / `--embed-image-priority` 与 `scripts/perf_cases.json.example` | | |
| 7 | +| `benchmarks/perf_api_benchmark.py` | 搜索后端、向量、翻译、重排等 HTTP 接口压测;支持 `--embed-text-priority` / `--embed-image-priority` 与 `benchmarks/perf_cases.json.example` | | |
| 8 | 8 | |
| 9 | 9 | 历史矩阵示例(并发扫描): |
| 10 | 10 | |
| ... | ... | @@ -25,10 +25,10 @@ |
| 25 | 25 | |
| 26 | 26 | ```bash |
| 27 | 27 | source activate.sh |
| 28 | -python scripts/perf_api_benchmark.py --scenario embed_text --duration 8 --concurrency 10 --timeout 30 --output perf_reports/2026-03-20_embed_text_p0.json | |
| 29 | -python scripts/perf_api_benchmark.py --scenario embed_text --duration 8 --concurrency 10 --embed-text-priority 1 --output perf_reports/2026-03-20_embed_text_p1.json | |
| 30 | -python scripts/perf_api_benchmark.py --scenario embed_image --duration 8 --concurrency 5 --timeout 60 --output perf_reports/2026-03-20_embed_image_p0.json | |
| 31 | -python scripts/perf_api_benchmark.py --scenario embed_image --duration 8 --concurrency 5 --embed-image-priority 1 --output perf_reports/2026-03-20_embed_image_p1.json | |
| 28 | +python benchmarks/perf_api_benchmark.py --scenario embed_text --duration 8 --concurrency 10 --timeout 30 --output perf_reports/2026-03-20_embed_text_p0.json | |
| 29 | +python benchmarks/perf_api_benchmark.py --scenario embed_text --duration 8 --concurrency 10 --embed-text-priority 1 --output perf_reports/2026-03-20_embed_text_p1.json | |
| 30 | +python benchmarks/perf_api_benchmark.py --scenario embed_image --duration 8 --concurrency 5 --timeout 60 --output perf_reports/2026-03-20_embed_image_p0.json | |
| 31 | +python benchmarks/perf_api_benchmark.py --scenario embed_image --duration 8 --concurrency 5 --embed-image-priority 1 --output perf_reports/2026-03-20_embed_image_p1.json | |
| 32 | 32 | ``` |
| 33 | 33 | |
| 34 | 34 | 说明:本次为 **8 秒 smoke**,与 `2026-03-12` 矩阵的时长/并发不可直接横向对比;仅验证 `priority` 参数下服务仍返回 200 且 payload 校验通过。 | ... | ... |
perf_reports/reranker_vllm_instruction/2026-03-25/RESULTS.md
| ... | ... | @@ -25,7 +25,7 @@ Shared across both backends for this run: |
| 25 | 25 | |
| 26 | 26 | ## Methodology |
| 27 | 27 | |
| 28 | -- Script: `python scripts/benchmark_reranker_random_titles.py 100,200,400,600,800,1000 --repeat 5` with **`--seed 99`** (see note below), **`--quiet-runs`**, **`--timeout 360`**. | |
| 28 | +- Script: `python benchmarks/reranker/benchmark_reranker_random_titles.py 100,200,400,600,800,1000 --repeat 5` with **`--seed 99`** (see note below), **`--quiet-runs`**, **`--timeout 360`**. | |
| 29 | 29 | - Titles: default file `/home/ubuntu/rerank_test/titles.1.8w` (one title per line). |
| 30 | 30 | - Query: default `健身女生T恤短袖`. |
| 31 | 31 | - Each scenario: **3 warm-up** requests at `n=400` (not timed), then **5 timed** runs per `n`. |
| ... | ... | @@ -56,9 +56,9 @@ JSON aggregates (means, stdev, raw `values_ms`): same directory, `qwen3_vllm_{co |
| 56 | 56 | ## Tooling added / changed |
| 57 | 57 | |
| 58 | 58 | - `reranker/server.py`: `/health` includes `instruction_format` when the active backend sets `_instruction_format`. |
| 59 | -- `scripts/benchmark_reranker_random_titles.py`: `--tag`, `--json-summary-out`, `--quiet-runs`. | |
| 60 | -- `scripts/patch_rerank_vllm_benchmark_config.py`: surgical YAML patch (preserves newlines). | |
| 61 | -- `scripts/run_reranker_vllm_instruction_benchmark.sh`: full matrix driver (continues if a benchmark exits non-zero; uses `--timeout 360`). | |
| 59 | +- `benchmarks/reranker/benchmark_reranker_random_titles.py`: `--tag`, `--json-summary-out`, `--quiet-runs`. | |
| 60 | +- `benchmarks/reranker/patch_rerank_vllm_benchmark_config.py`: surgical YAML patch (preserves newlines). | |
| 61 | +- `benchmarks/reranker/run_reranker_vllm_instruction_benchmark.sh`: full matrix driver (continues if a benchmark exits non-zero; uses `--timeout 360`). | |
| 62 | 62 | |
| 63 | 63 | --- |
| 64 | 64 | |
| ... | ... | @@ -73,7 +73,7 @@ JSON aggregates (means, stdev, raw `values_ms`): same directory, `qwen3_vllm_{co |
| 73 | 73 | | Attention | Backend forced / steered attention on T4 (e.g. `TRITON_ATTN` path) | **No** `attention_config` in `LLM(...)`; vLLM **auto** — on this T4 run, logs show **`FLASHINFER`** | |
| 74 | 74 | | Config surface | `vllm_attention_backend` / `RERANK_VLLM_ATTENTION_BACKEND` 等 | **Removed**(少 YAML/环境变量分支,逻辑收敛) | |
| 75 | 75 | | Code default `instruction_format` | `qwen3_vllm_score` 默认 `standard` | 与 `qwen3_vllm` 对齐为 **`compact`**(仍可在 YAML 写 `standard`) | |
| 76 | -| Smoke / 启动 | — | `scripts/smoke_qwen3_vllm_score_backend.py`;`scripts/start_reranker.sh` 将 **venv `bin` 置于 `PATH`**(FLASHINFER JIT 依赖 venv 内的 `ninja`) | | |
| 76 | +| Smoke / 启动 | — | `benchmarks/reranker/smoke_qwen3_vllm_score_backend.py`;`scripts/start_reranker.sh` 将 **venv `bin` 置于 `PATH`**(FLASHINFER JIT 依赖 venv 内的 `ninja`) | | |
| 77 | 77 | |
| 78 | 78 | Micro-benchmark (same machine, isolated): **~927.5 ms → ~673.1 ms** at **n=400** docs on `LLM.score()` steady state (~**28%**), after removing the forced attention path and letting vLLM pick **FLASHINFER**. |
| 79 | 79 | ... | ... |
requirements_translator_service.txt
| ... | ... | @@ -13,7 +13,8 @@ httpx>=0.24.0 |
| 13 | 13 | tqdm>=4.65.0 |
| 14 | 14 | |
| 15 | 15 | torch>=2.0.0 |
| 16 | -transformers>=4.30.0 | |
| 16 | +# Keep translator conversions on the last verified NLLB-compatible release line. | |
| 17 | +transformers>=4.51.0,<4.52.0 | |
| 17 | 18 | ctranslate2>=4.7.0 |
| 18 | 19 | sentencepiece>=0.2.0 |
| 19 | 20 | sacremoses>=0.1.1 | ... | ... |
reranker/DEPLOYMENT_AND_TUNING.md
reranker/GGUF_0_6B_INSTALL_AND_TUNING.md
| ... | ... | @@ -144,7 +144,7 @@ qwen3_gguf_06b: |
| 144 | 144 | |
| 145 | 145 | ```bash |
| 146 | 146 | PYTHONPATH=/data/saas-search ./.venv-reranker-gguf/bin/python \ |
| 147 | - scripts/benchmark_reranker_gguf_local.py --backend-name qwen3_gguf_06b --docs 400 | |
| 147 | + benchmarks/reranker/benchmark_reranker_gguf_local.py --backend-name qwen3_gguf_06b --docs 400 | |
| 148 | 148 | ``` |
| 149 | 149 | |
| 150 | 150 | 按服务方式启动: | ... | ... |
reranker/GGUF_INSTALL_AND_TUNING.md
| ... | ... | @@ -117,7 +117,7 @@ HF_HUB_DISABLE_XET=1 |
| 117 | 117 | |
| 118 | 118 | ```bash |
| 119 | 119 | PYTHONPATH=/data/saas-search ./.venv-reranker-gguf/bin/python \ |
| 120 | - scripts/benchmark_reranker_gguf_local.py --docs 64 --repeat 1 | |
| 120 | + benchmarks/reranker/benchmark_reranker_gguf_local.py --docs 64 --repeat 1 | |
| 121 | 121 | ``` |
| 122 | 122 | |
| 123 | 123 | 它会直接实例化 GGUF backend,输出: |
| ... | ... | @@ -134,7 +134,7 @@ PYTHONPATH=/data/saas-search ./.venv-reranker-gguf/bin/python \ |
| 134 | 134 | |
| 135 | 135 | - Query: `白色oversized T-shirt` |
| 136 | 136 | - Docs: `64` 条商品标题 |
| 137 | -- 本地脚本:`scripts/benchmark_reranker_gguf_local.py` | |
| 137 | +- 本地脚本:`benchmarks/reranker/benchmark_reranker_gguf_local.py` | |
| 138 | 138 | - 每组 1 次,重点比较相对趋势 |
| 139 | 139 | |
| 140 | 140 | 结果: |
| ... | ... | @@ -195,7 +195,7 @@ n_gpu_layers=999 |
| 195 | 195 | |
| 196 | 196 | ```bash |
| 197 | 197 | RERANK_BASE=http://127.0.0.1:6007 \ |
| 198 | - ./.venv/bin/python scripts/benchmark_reranker_random_titles.py 64 --repeat 1 --query '白色oversized T-shirt' | |
| 198 | + ./.venv/bin/python benchmarks/reranker/benchmark_reranker_random_titles.py 64 --repeat 1 --query '白色oversized T-shirt' | |
| 199 | 199 | ``` |
| 200 | 200 | |
| 201 | 201 | 得到: |
| ... | ... | @@ -206,7 +206,7 @@ RERANK_BASE=http://127.0.0.1:6007 \ |
| 206 | 206 | |
| 207 | 207 | ```bash |
| 208 | 208 | RERANK_BASE=http://127.0.0.1:6007 \ |
| 209 | - ./.venv/bin/python scripts/benchmark_reranker_random_titles.py 153 --repeat 1 --query '白色oversized T-shirt' | |
| 209 | + ./.venv/bin/python benchmarks/reranker/benchmark_reranker_random_titles.py 153 --repeat 1 --query '白色oversized T-shirt' | |
| 210 | 210 | ``` |
| 211 | 211 | |
| 212 | 212 | 得到: |
| ... | ... | @@ -276,5 +276,5 @@ offload_kqv: true |
| 276 | 276 | - `config/config.yaml` |
| 277 | 277 | - `scripts/setup_reranker_venv.sh` |
| 278 | 278 | - `scripts/start_reranker.sh` |
| 279 | -- `scripts/benchmark_reranker_gguf_local.py` | |
| 279 | +- `benchmarks/reranker/benchmark_reranker_gguf_local.py` | |
| 280 | 280 | - `reranker/GGUF_INSTALL_AND_TUNING.md` | ... | ... |
reranker/README.md
| ... | ... | @@ -46,9 +46,9 @@ Reranker 服务提供统一的 `/rerank` API,支持可插拔后端(BGE、Jin |
| 46 | 46 | - `backends/dashscope_rerank.py`:DashScope 云端重排后端 |
| 47 | 47 | - `scripts/setup_reranker_venv.sh`:按后端创建独立 venv |
| 48 | 48 | - `scripts/start_reranker.sh`:启动 reranker 服务 |
| 49 | -- `scripts/smoke_qwen3_vllm_score_backend.py`:`qwen3_vllm_score` 本地 smoke | |
| 50 | -- `scripts/benchmark_reranker_random_titles.py`:随机标题压测脚本 | |
| 51 | -- `scripts/run_reranker_vllm_instruction_benchmark.sh`:历史矩阵脚本 | |
| 49 | +- `benchmarks/reranker/smoke_qwen3_vllm_score_backend.py`:`qwen3_vllm_score` 本地 smoke | |
| 50 | +- `benchmarks/reranker/benchmark_reranker_random_titles.py`:随机标题压测脚本 | |
| 51 | +- `benchmarks/reranker/run_reranker_vllm_instruction_benchmark.sh`:历史矩阵脚本 | |
| 52 | 52 | |
| 53 | 53 | ## 环境基线 |
| 54 | 54 | |
| ... | ... | @@ -118,7 +118,7 @@ nvidia-smi |
| 118 | 118 | ### 4. Smoke |
| 119 | 119 | |
| 120 | 120 | ```bash |
| 121 | -PYTHONPATH=. ./.venv-reranker-score/bin/python scripts/smoke_qwen3_vllm_score_backend.py --gpu-memory-utilization 0.2 | |
| 121 | +PYTHONPATH=. ./.venv-reranker-score/bin/python benchmarks/reranker/smoke_qwen3_vllm_score_backend.py --gpu-memory-utilization 0.2 | |
| 122 | 122 | ``` |
| 123 | 123 | |
| 124 | 124 | ## `jina_reranker_v3` | ... | ... |
| ... | ... | @@ -0,0 +1,59 @@ |
| 1 | +# Scripts | |
| 2 | + | |
| 3 | +`scripts/` 现在只保留当前架构下仍然有效的运行、运维、环境和数据处理脚本,并按职责拆到稳定子目录,避免继续在根目录平铺。 | |
| 4 | + | |
| 5 | +## 当前分类 | |
| 6 | + | |
| 7 | +- 服务编排 | |
| 8 | + - `service_ctl.sh` | |
| 9 | + - `start_backend.sh` | |
| 10 | + - `start_indexer.sh` | |
| 11 | + - `start_frontend.sh` | |
| 12 | + - `start_eval_web.sh` | |
| 13 | + - `start_embedding_service.sh` | |
| 14 | + - `start_embedding_text_service.sh` | |
| 15 | + - `start_embedding_image_service.sh` | |
| 16 | + - `start_reranker.sh` | |
| 17 | + - `start_translator.sh` | |
| 18 | + - `start_tei_service.sh` | |
| 19 | + - `start_cnclip_service.sh` | |
| 20 | + - `stop.sh` | |
| 21 | + - `stop_tei_service.sh` | |
| 22 | + - `stop_cnclip_service.sh` | |
| 23 | + - `frontend/` | |
| 24 | + - `ops/` | |
| 25 | + | |
| 26 | +- 环境初始化 | |
| 27 | + - `create_venv.sh` | |
| 28 | + - `init_env.sh` | |
| 29 | + - `setup_embedding_venv.sh` | |
| 30 | + - `setup_reranker_venv.sh` | |
| 31 | + - `setup_translator_venv.sh` | |
| 32 | + - `setup_cnclip_venv.sh` | |
| 33 | + | |
| 34 | +- 数据与索引 | |
| 35 | + - `create_tenant_index.sh` | |
| 36 | + - `build_suggestions.sh` | |
| 37 | + - `mock_data.sh` | |
| 38 | + - `data_import/` | |
| 39 | + - `inspect/` | |
| 40 | + - `maintenance/` | |
| 41 | + | |
| 42 | +- 评估与专项工具 | |
| 43 | + - `evaluation/` | |
| 44 | + - `redis/` | |
| 45 | + - `debug/` | |
| 46 | + - `translation/` | |
| 47 | + | |
| 48 | +## 已迁移 | |
| 49 | + | |
| 50 | +- 基准压测与 smoke 脚本:迁到 `benchmarks/` | |
| 51 | +- 手工接口试跑脚本:迁到 `tests/manual/` | |
| 52 | + | |
| 53 | +## 已清理 | |
| 54 | + | |
| 55 | +- 历史备份目录:`indexer__old_2025_11/` | |
| 56 | +- 过时壳脚本:`start.sh` | |
| 57 | +- Conda 时代残留:`install_server_deps.sh` | |
| 58 | + | |
| 59 | +后续如果新增脚本,优先放到明确子目录,不再把 benchmark、manual、历史备份直接丢回根 `scripts/`。 | ... | ... |
| ... | ... | @@ -0,0 +1,13 @@ |
| 1 | +# Data Import Scripts | |
| 2 | + | |
| 3 | +这一组脚本用于把外部商品数据或 CSV/XLSX 样本转换为 Shoplazza 导入格式。 | |
| 4 | + | |
| 5 | +- `amazon_xlsx_to_shoplazza_xlsx.py` | |
| 6 | +- `competitor_xlsx_to_shoplazza_xlsx.py` | |
| 7 | +- `csv_to_excel.py` | |
| 8 | +- `csv_to_excel_multi_variant.py` | |
| 9 | +- `shoplazza_excel_template.py` | |
| 10 | +- `shoplazza_import_template.py` | |
| 11 | +- `tenant3_csv_to_shoplazza_xlsx.sh` | |
| 12 | + | |
| 13 | +这里是离线数据转换工具,不属于线上服务运维入口。 | ... | ... |
scripts/amazon_xlsx_to_shoplazza_xlsx.py renamed to scripts/data_import/amazon_xlsx_to_shoplazza_xlsx.py
| ... | ... | @@ -35,9 +35,10 @@ from pathlib import Path |
| 35 | 35 | |
| 36 | 36 | from openpyxl import load_workbook |
| 37 | 37 | |
| 38 | -# Allow running as `python scripts/xxx.py` without installing as a package | |
| 39 | -sys.path.insert(0, str(Path(__file__).resolve().parent)) | |
| 40 | -from shoplazza_excel_template import create_excel_from_template_fast | |
| 38 | +REPO_ROOT = Path(__file__).resolve().parents[2] | |
| 39 | +sys.path.insert(0, str(REPO_ROOT)) | |
| 40 | + | |
| 41 | +from scripts.data_import.shoplazza_excel_template import create_excel_from_template_fast | |
| 41 | 42 | |
| 42 | 43 | |
| 43 | 44 | PREFERRED_OPTION_KEYS = [ |
| ... | ... | @@ -612,4 +613,3 @@ def main(): |
| 612 | 613 | if __name__ == "__main__": |
| 613 | 614 | main() |
| 614 | 615 | |
| 615 | - | ... | ... |
scripts/competitor_xlsx_to_shoplazza_xlsx.py renamed to scripts/data_import/competitor_xlsx_to_shoplazza_xlsx.py
| ... | ... | @@ -6,7 +6,7 @@ The input `data/mai_jia_jing_ling/products_data/*.xlsx` files are Amazon-format |
| 6 | 6 | (Parent/Child ASIN), not “competitor data”. |
| 7 | 7 | |
| 8 | 8 | Please use: |
| 9 | - - `scripts/amazon_xlsx_to_shoplazza_xlsx.py` | |
| 9 | + - `scripts/data_import/amazon_xlsx_to_shoplazza_xlsx.py` | |
| 10 | 10 | |
| 11 | 11 | This wrapper simply forwards all CLI args to the correctly named script, so you |
| 12 | 12 | automatically get the latest performance improvements (fast read/write). |
| ... | ... | @@ -15,13 +15,12 @@ automatically get the latest performance improvements (fast read/write). |
| 15 | 15 | import sys |
| 16 | 16 | from pathlib import Path |
| 17 | 17 | |
| 18 | -# Allow running as `python scripts/xxx.py` without installing as a package | |
| 19 | -sys.path.insert(0, str(Path(__file__).resolve().parent)) | |
| 18 | +REPO_ROOT = Path(__file__).resolve().parents[2] | |
| 19 | +sys.path.insert(0, str(REPO_ROOT)) | |
| 20 | 20 | |
| 21 | -from amazon_xlsx_to_shoplazza_xlsx import main as amazon_main | |
| 21 | +from scripts.data_import.amazon_xlsx_to_shoplazza_xlsx import main as amazon_main | |
| 22 | 22 | |
| 23 | 23 | |
| 24 | 24 | if __name__ == "__main__": |
| 25 | 25 | amazon_main() |
| 26 | 26 | |
| 27 | - | ... | ... |
scripts/csv_to_excel.py renamed to scripts/data_import/csv_to_excel.py
| ... | ... | @@ -22,12 +22,12 @@ from openpyxl import load_workbook |
| 22 | 22 | from openpyxl.styles import Font, Alignment |
| 23 | 23 | from openpyxl.utils import get_column_letter |
| 24 | 24 | |
| 25 | -# Shared helpers (keeps template writing consistent across scripts) | |
| 26 | -from scripts.shoplazza_import_template import create_excel_from_template as _create_excel_from_template_shared | |
| 27 | -from scripts.shoplazza_import_template import generate_handle as _generate_handle_shared | |
| 25 | +REPO_ROOT = Path(__file__).resolve().parents[2] | |
| 26 | +sys.path.insert(0, str(REPO_ROOT)) | |
| 28 | 27 | |
| 29 | -# Add parent directory to path | |
| 30 | -sys.path.insert(0, str(Path(__file__).parent.parent)) | |
| 28 | +# Shared helpers (keeps template writing consistent across scripts) | |
| 29 | +from scripts.data_import.shoplazza_import_template import create_excel_from_template as _create_excel_from_template_shared | |
| 30 | +from scripts.data_import.shoplazza_import_template import generate_handle as _generate_handle_shared | |
| 31 | 31 | |
| 32 | 32 | |
| 33 | 33 | def clean_value(value): |
| ... | ... | @@ -299,4 +299,3 @@ def main(): |
| 299 | 299 | |
| 300 | 300 | if __name__ == '__main__': |
| 301 | 301 | main() |
| 302 | - | ... | ... |
scripts/csv_to_excel_multi_variant.py renamed to scripts/data_import/csv_to_excel_multi_variant.py
| ... | ... | @@ -22,12 +22,12 @@ import itertools |
| 22 | 22 | from openpyxl import load_workbook |
| 23 | 23 | from openpyxl.styles import Alignment |
| 24 | 24 | |
| 25 | -# Shared helpers (keeps template writing consistent across scripts) | |
| 26 | -from scripts.shoplazza_import_template import create_excel_from_template as _create_excel_from_template_shared | |
| 27 | -from scripts.shoplazza_import_template import generate_handle as _generate_handle_shared | |
| 25 | +REPO_ROOT = Path(__file__).resolve().parents[2] | |
| 26 | +sys.path.insert(0, str(REPO_ROOT)) | |
| 28 | 27 | |
| 29 | -# Add parent directory to path | |
| 30 | -sys.path.insert(0, str(Path(__file__).parent.parent)) | |
| 28 | +# Shared helpers (keeps template writing consistent across scripts) | |
| 29 | +from scripts.data_import.shoplazza_import_template import create_excel_from_template as _create_excel_from_template_shared | |
| 30 | +from scripts.data_import.shoplazza_import_template import generate_handle as _generate_handle_shared | |
| 31 | 31 | |
| 32 | 32 | # Color definitions |
| 33 | 33 | COLORS = [ |
| ... | ... | @@ -562,4 +562,3 @@ def main(): |
| 562 | 562 | |
| 563 | 563 | if __name__ == '__main__': |
| 564 | 564 | main() |
| 565 | - | ... | ... |
scripts/shoplazza_excel_template.py renamed to scripts/data_import/shoplazza_excel_template.py
scripts/shoplazza_import_template.py renamed to scripts/data_import/shoplazza_import_template.py
scripts/tenant3__csv_to_shoplazza_xlsx.sh renamed to scripts/data_import/tenant3_csv_to_shoplazza_xlsx.sh
| ... | ... | @@ -5,16 +5,16 @@ cd "$(dirname "$0")/.." |
| 5 | 5 | source ./activate.sh |
| 6 | 6 | |
| 7 | 7 | # # 基本使用(生成所有数据) |
| 8 | -# python scripts/csv_to_excel.py | |
| 8 | +# python scripts/data_import/csv_to_excel.py | |
| 9 | 9 | |
| 10 | 10 | # # 指定输出文件 |
| 11 | -# python scripts/csv_to_excel.py --output tenant3_imports.xlsx | |
| 11 | +# python scripts/data_import/csv_to_excel.py --output tenant3_imports.xlsx | |
| 12 | 12 | |
| 13 | 13 | # # 限制处理行数(用于测试) |
| 14 | -# python scripts/csv_to_excel.py --limit 100 | |
| 14 | +# python scripts/data_import/csv_to_excel.py --limit 100 | |
| 15 | 15 | |
| 16 | 16 | # 指定CSV文件和模板文件 |
| 17 | -python scripts/csv_to_excel.py \ | |
| 17 | +python scripts/data_import/csv_to_excel.py \ | |
| 18 | 18 | --csv-file data/customer1/goods_with_pic.5years_congku.csv.shuf.1w \ |
| 19 | 19 | --template docs/商品导入模板.xlsx \ |
| 20 | - --output tenant3_imports.xlsx | |
| 21 | 20 | \ No newline at end of file |
| 21 | + --output tenant3_imports.xlsx | ... | ... |
scripts/trace_indexer_calls.sh renamed to scripts/debug/trace_indexer_calls.sh
| 1 | 1 | #!/usr/bin/env python3 |
| 2 | -"""Download local translation models declared in services.translation.capabilities.""" | |
| 2 | +"""Backward-compatible entrypoint for translation model downloads.""" | |
| 3 | 3 | |
| 4 | 4 | from __future__ import annotations |
| 5 | 5 | |
| 6 | -import argparse | |
| 7 | -import os | |
| 6 | +import runpy | |
| 8 | 7 | from pathlib import Path |
| 9 | -import shutil | |
| 10 | -import subprocess | |
| 11 | -import sys | |
| 12 | -from typing import Iterable | |
| 13 | - | |
| 14 | -from huggingface_hub import snapshot_download | |
| 15 | - | |
| 16 | -PROJECT_ROOT = Path(__file__).resolve().parent.parent | |
| 17 | -if str(PROJECT_ROOT) not in sys.path: | |
| 18 | - sys.path.insert(0, str(PROJECT_ROOT)) | |
| 19 | -os.environ.setdefault("HF_HUB_DISABLE_XET", "1") | |
| 20 | - | |
| 21 | -from config.services_config import get_translation_config | |
| 22 | - | |
| 23 | - | |
| 24 | -LOCAL_BACKENDS = {"local_nllb", "local_marian"} | |
| 25 | - | |
| 26 | - | |
| 27 | -def iter_local_capabilities(selected: set[str] | None = None) -> Iterable[tuple[str, dict]]: | |
| 28 | - cfg = get_translation_config() | |
| 29 | - capabilities = cfg.get("capabilities", {}) if isinstance(cfg, dict) else {} | |
| 30 | - for name, capability in capabilities.items(): | |
| 31 | - backend = str(capability.get("backend") or "").strip().lower() | |
| 32 | - if backend not in LOCAL_BACKENDS: | |
| 33 | - continue | |
| 34 | - if selected and name not in selected: | |
| 35 | - continue | |
| 36 | - yield name, capability | |
| 37 | - | |
| 38 | - | |
| 39 | -def _compute_ct2_output_dir(capability: dict) -> Path: | |
| 40 | - custom = str(capability.get("ct2_model_dir") or "").strip() | |
| 41 | - if custom: | |
| 42 | - return Path(custom).expanduser() | |
| 43 | - model_dir = Path(str(capability.get("model_dir") or "")).expanduser() | |
| 44 | - compute_type = str(capability.get("ct2_compute_type") or capability.get("torch_dtype") or "default").strip().lower() | |
| 45 | - normalized = compute_type.replace("_", "-") | |
| 46 | - return model_dir / f"ctranslate2-{normalized}" | |
| 47 | - | |
| 48 | - | |
| 49 | -def _resolve_converter_binary() -> str: | |
| 50 | - candidate = shutil.which("ct2-transformers-converter") | |
| 51 | - if candidate: | |
| 52 | - return candidate | |
| 53 | - venv_candidate = Path(sys.executable).absolute().parent / "ct2-transformers-converter" | |
| 54 | - if venv_candidate.exists(): | |
| 55 | - return str(venv_candidate) | |
| 56 | - raise RuntimeError( | |
| 57 | - "ct2-transformers-converter was not found. " | |
| 58 | - "Install ctranslate2 in the active Python environment first." | |
| 59 | - ) | |
| 60 | - | |
| 61 | - | |
| 62 | -def convert_to_ctranslate2(name: str, capability: dict) -> None: | |
| 63 | - model_id = str(capability.get("model_id") or "").strip() | |
| 64 | - model_dir = Path(str(capability.get("model_dir") or "")).expanduser() | |
| 65 | - model_source = str(model_dir if model_dir.exists() else model_id) | |
| 66 | - output_dir = _compute_ct2_output_dir(capability) | |
| 67 | - if (output_dir / "model.bin").exists(): | |
| 68 | - print(f"[skip-convert] {name} -> {output_dir}") | |
| 69 | - return | |
| 70 | - quantization = str( | |
| 71 | - capability.get("ct2_conversion_quantization") | |
| 72 | - or capability.get("ct2_compute_type") | |
| 73 | - or capability.get("torch_dtype") | |
| 74 | - or "default" | |
| 75 | - ).strip() | |
| 76 | - output_dir.parent.mkdir(parents=True, exist_ok=True) | |
| 77 | - print(f"[convert] {name} -> {output_dir} ({quantization})") | |
| 78 | - subprocess.run( | |
| 79 | - [ | |
| 80 | - _resolve_converter_binary(), | |
| 81 | - "--model", | |
| 82 | - model_source, | |
| 83 | - "--output_dir", | |
| 84 | - str(output_dir), | |
| 85 | - "--quantization", | |
| 86 | - quantization, | |
| 87 | - ], | |
| 88 | - check=True, | |
| 89 | - ) | |
| 90 | - print(f"[converted] {name}") | |
| 91 | - | |
| 92 | - | |
| 93 | -def main() -> None: | |
| 94 | - parser = argparse.ArgumentParser(description="Download local translation models") | |
| 95 | - parser.add_argument("--all-local", action="store_true", help="Download all configured local translation models") | |
| 96 | - parser.add_argument("--models", nargs="*", default=[], help="Specific capability names to download") | |
| 97 | - parser.add_argument( | |
| 98 | - "--convert-ctranslate2", | |
| 99 | - action="store_true", | |
| 100 | - help="Also convert the downloaded Hugging Face models into CTranslate2 format", | |
| 101 | - ) | |
| 102 | - args = parser.parse_args() | |
| 103 | - | |
| 104 | - selected = {item.strip().lower() for item in args.models if item.strip()} or None | |
| 105 | - if not args.all_local and not selected: | |
| 106 | - parser.error("pass --all-local or --models <name> ...") | |
| 107 | - | |
| 108 | - for name, capability in iter_local_capabilities(selected): | |
| 109 | - model_id = str(capability.get("model_id") or "").strip() | |
| 110 | - model_dir = Path(str(capability.get("model_dir") or "")).expanduser() | |
| 111 | - if not model_id or not model_dir: | |
| 112 | - raise ValueError(f"Capability '{name}' must define model_id and model_dir") | |
| 113 | - model_dir.parent.mkdir(parents=True, exist_ok=True) | |
| 114 | - print(f"[download] {name} -> {model_dir} ({model_id})") | |
| 115 | - snapshot_download( | |
| 116 | - repo_id=model_id, | |
| 117 | - local_dir=str(model_dir), | |
| 118 | - ) | |
| 119 | - print(f"[done] {name}") | |
| 120 | - if args.convert_ctranslate2: | |
| 121 | - convert_to_ctranslate2(name, capability) | |
| 122 | 8 | |
| 123 | 9 | |
| 124 | 10 | if __name__ == "__main__": |
| 125 | - main() | |
| 11 | + target = Path(__file__).resolve().parent / "translation" / "download_translation_models.py" | |
| 12 | + runpy.run_path(str(target), run_name="__main__") | ... | ... |
scripts/evaluation/README.md
| ... | ... | @@ -127,8 +127,8 @@ This framework now follows graded ranking evaluation closer to e-commerce best p |
| 127 | 127 | - **Composite tuning score: `Primary_Metric_Score`** |
| 128 | 128 | For experiment ranking we compute the mean of the primary scorecard after normalizing `Avg_Grade@10` by the max grade (`3`). |
| 129 | 129 | - **Gain scheme** |
| 130 | - `Fully Relevant=7`, `Mostly Relevant=3`, `Weakly Relevant=1`, `Irrelevant=0` | |
| 131 | - The gains come from rel grades `3/2/1/0` with `gain = 2^rel - 1`, a standard `NDCG` setup. | |
| 130 | + `Fully Relevant=3`, `Mostly Relevant=2`, `Weakly Relevant=1`, `Irrelevant=0` | |
| 131 | + We keep the rel grades `3/2/1/0`, but the current implementation uses the grade values directly as gains so the exact/high gap is less aggressive. | |
| 132 | 132 | - **Why this is better** |
| 133 | 133 | `NDCG` differentiates “exact”, “strong substitute”, and “weak substitute”, so swapping an `Fully Relevant` with a `Weakly Relevant` item is penalized more than swapping `Mostly Relevant` with `Weakly Relevant`. |
| 134 | 134 | |
| ... | ... | @@ -174,6 +174,22 @@ Features: query list from `queries.txt`, single-query and batch evaluation, batc |
| 174 | 174 | |
| 175 | 175 | Each run stores aggregate and per-query metrics, label distribution, timestamp, metric context (including gain scheme and primary metric), and an `/admin/config` snapshot, as Markdown and JSON under `batch_reports/`. |
| 176 | 176 | |
| 177 | +To make later case analysis reproducible without digging through backend logs, each per-query record in the batch JSON now also includes: | |
| 178 | + | |
| 179 | +- `request_id` — the exact `X-Request-ID` sent by the evaluator for that live search call | |
| 180 | +- `top_label_sequence_top10` / `top_label_sequence_top20` — compact label sequence strings such as `1:L3 | 2:L1 | 3:L2` | |
| 181 | +- `top_results` — a lightweight top-20 snapshot with `rank`, `spu_id`, `label`, title fields, and `relevance_score` | |
| 182 | + | |
| 183 | +The Markdown report now surfaces the same case context in a lighter human-readable form: | |
| 184 | + | |
| 185 | +- request id | |
| 186 | +- top-10 / top-20 label sequence | |
| 187 | +- top 5 result snapshot for quick scanning | |
| 188 | + | |
| 189 | +This means a bad case can usually be reconstructed directly from the batch artifact itself, without replaying logs or joining SQLite tables by hand. | |
| 190 | + | |
| 191 | +The web history endpoint intentionally returns a compact summary only (aggregate metrics plus query count), so adding richer per-query snapshots to the batch payload does not bloat the history list UI. | |
| 192 | + | |
| 177 | 193 | ## Ranking debug and LTR prep |
| 178 | 194 | |
| 179 | 195 | `debug_info` now exposes two extra layers that are useful for tuning and future learning-to-rank work: | ... | ... |
scripts/evaluation/eval_framework/__init__.py
| ... | ... | @@ -14,10 +14,10 @@ from .constants import ( # noqa: E402 |
| 14 | 14 | DEFAULT_ARTIFACT_ROOT, |
| 15 | 15 | DEFAULT_QUERY_FILE, |
| 16 | 16 | PROJECT_ROOT, |
| 17 | - RELEVANCE_EXACT, | |
| 18 | - RELEVANCE_HIGH, | |
| 19 | - RELEVANCE_IRRELEVANT, | |
| 20 | - RELEVANCE_LOW, | |
| 17 | + RELEVANCE_LV0, | |
| 18 | + RELEVANCE_LV1, | |
| 19 | + RELEVANCE_LV2, | |
| 20 | + RELEVANCE_LV3, | |
| 21 | 21 | RELEVANCE_NON_IRRELEVANT, |
| 22 | 22 | VALID_LABELS, |
| 23 | 23 | ) |
| ... | ... | @@ -39,10 +39,10 @@ __all__ = [ |
| 39 | 39 | "EvalStore", |
| 40 | 40 | "PROJECT_ROOT", |
| 41 | 41 | "QueryBuildResult", |
| 42 | - "RELEVANCE_EXACT", | |
| 43 | - "RELEVANCE_HIGH", | |
| 44 | - "RELEVANCE_IRRELEVANT", | |
| 45 | - "RELEVANCE_LOW", | |
| 42 | + "RELEVANCE_LV0", | |
| 43 | + "RELEVANCE_LV1", | |
| 44 | + "RELEVANCE_LV2", | |
| 45 | + "RELEVANCE_LV3", | |
| 46 | 46 | "RELEVANCE_NON_IRRELEVANT", |
| 47 | 47 | "SearchEvaluationFramework", |
| 48 | 48 | "VALID_LABELS", | ... | ... |
scripts/evaluation/eval_framework/clients.py
| ... | ... | @@ -157,6 +157,7 @@ class SearchServiceClient: |
| 157 | 157 | return self._request_json("GET", path, timeout=timeout) |
| 158 | 158 | |
| 159 | 159 | def search(self, query: str, size: int, from_: int = 0, language: str = "en", *, debug: bool = False) -> Dict[str, Any]: |
| 160 | + request_id = uuid.uuid4().hex[:8] | |
| 160 | 161 | payload: Dict[str, Any] = { |
| 161 | 162 | "query": query, |
| 162 | 163 | "size": size, |
| ... | ... | @@ -165,13 +166,19 @@ class SearchServiceClient: |
| 165 | 166 | } |
| 166 | 167 | if debug: |
| 167 | 168 | payload["debug"] = True |
| 168 | - return self._request_json( | |
| 169 | + response = self._request_json( | |
| 169 | 170 | "POST", |
| 170 | 171 | "/search/", |
| 171 | 172 | timeout=120, |
| 172 | - headers={"Content-Type": "application/json", "X-Tenant-ID": self.tenant_id}, | |
| 173 | + headers={ | |
| 174 | + "Content-Type": "application/json", | |
| 175 | + "X-Tenant-ID": self.tenant_id, | |
| 176 | + "X-Request-ID": request_id, | |
| 177 | + }, | |
| 173 | 178 | json_payload=payload, |
| 174 | 179 | ) |
| 180 | + response["_eval_request_id"] = request_id | |
| 181 | + return response | |
| 175 | 182 | |
| 176 | 183 | |
| 177 | 184 | class RerankServiceClient: | ... | ... |
scripts/evaluation/eval_framework/constants.py
| ... | ... | @@ -7,24 +7,24 @@ _SCRIPTS_EVAL_DIR = _PKG_DIR.parent |
| 7 | 7 | PROJECT_ROOT = _SCRIPTS_EVAL_DIR.parents[1] |
| 8 | 8 | |
| 9 | 9 | # Canonical English labels (must match LLM prompt output in prompts._CLASSIFY_TEMPLATE_EN) |
| 10 | -RELEVANCE_EXACT = "Fully Relevant" | |
| 11 | -RELEVANCE_HIGH = "Mostly Relevant" | |
| 12 | -RELEVANCE_LOW = "Weakly Relevant" | |
| 13 | -RELEVANCE_IRRELEVANT = "Irrelevant" | |
| 10 | +RELEVANCE_LV3 = "Fully Relevant" | |
| 11 | +RELEVANCE_LV2 = "Mostly Relevant" | |
| 12 | +RELEVANCE_LV1 = "Weakly Relevant" | |
| 13 | +RELEVANCE_LV0 = "Irrelevant" | |
| 14 | 14 | |
| 15 | -VALID_LABELS = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_LOW, RELEVANCE_IRRELEVANT}) | |
| 15 | +VALID_LABELS = frozenset({RELEVANCE_LV3, RELEVANCE_LV2, RELEVANCE_LV1, RELEVANCE_LV0}) | |
| 16 | 16 | |
| 17 | 17 | # Useful label sets for binary diagnostic slices layered on top of graded ranking metrics. |
| 18 | -RELEVANCE_NON_IRRELEVANT = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_LOW}) | |
| 19 | -RELEVANCE_STRONG = frozenset({RELEVANCE_EXACT, RELEVANCE_HIGH}) | |
| 18 | +RELEVANCE_NON_IRRELEVANT = frozenset({RELEVANCE_LV3, RELEVANCE_LV2, RELEVANCE_LV1}) | |
| 19 | +RELEVANCE_STRONG = frozenset({RELEVANCE_LV3, RELEVANCE_LV2}) | |
| 20 | 20 | |
| 21 | 21 | # Graded relevance for ranking evaluation. |
| 22 | 22 | # We use rel grades 3/2/1/0 and gain = 2^rel - 1, which is standard for NDCG-style metrics. |
| 23 | 23 | RELEVANCE_GRADE_MAP = { |
| 24 | - RELEVANCE_EXACT: 3, | |
| 25 | - RELEVANCE_HIGH: 2, | |
| 26 | - RELEVANCE_LOW: 1, | |
| 27 | - RELEVANCE_IRRELEVANT: 0, | |
| 24 | + RELEVANCE_LV3: 3, | |
| 25 | + RELEVANCE_LV2: 2, | |
| 26 | + RELEVANCE_LV1: 1, | |
| 27 | + RELEVANCE_LV0: 0, | |
| 28 | 28 | } |
| 29 | 29 | # 标准的gain计算方法:2^rel - 1 |
| 30 | 30 | # 但是是因为标注质量不是特别精确,因此适当降低 exact 和 high 的区分度 |
| ... | ... | @@ -35,11 +35,12 @@ RELEVANCE_GAIN_MAP = { |
| 35 | 35 | } |
| 36 | 36 | |
| 37 | 37 | # P(stop | relevance) for ERR (Expected Reciprocal Rank); cascade model (Chapelle et al., 2009). |
| 38 | +# p(t) = (2^t - 1) / 2^{max_grade} | |
| 38 | 39 | STOP_PROB_MAP = { |
| 39 | - RELEVANCE_EXACT: 0.99, | |
| 40 | - RELEVANCE_HIGH: 0.8, | |
| 41 | - RELEVANCE_LOW: 0.1, | |
| 42 | - RELEVANCE_IRRELEVANT: 0.0, | |
| 40 | + RELEVANCE_LV3: 0.875, | |
| 41 | + RELEVANCE_LV2: 0.375, | |
| 42 | + RELEVANCE_LV1: 0.125, | |
| 43 | + RELEVANCE_LV0: 0.0, | |
| 43 | 44 | } |
| 44 | 45 | |
| 45 | 46 | DEFAULT_ARTIFACT_ROOT = PROJECT_ROOT / "artifacts" / "search_evaluation" |
| ... | ... | @@ -78,7 +79,7 @@ DEFAULT_REBUILD_MAX_LLM_BATCHES = 40 |
| 78 | 79 | # A batch is "bad" when **both** hold (strict inequalities; see ``framework._annotate_rebuild_batches``): |
| 79 | 80 | # - irrelevant_ratio > DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO (default 93.9%), |
| 80 | 81 | # - (Irrelevant + Weakly Relevant) / n > DEFAULT_REBUILD_IRREL_LOW_COMBINED_STOP_RATIO (default 95.9%). |
| 81 | -# ``irrelevant_ratio`` = Irrelevant count / n; weak relevance is ``RELEVANCE_LOW`` ("Weakly Relevant"). | |
| 82 | +# ``irrelevant_ratio`` = Irrelevant count / n; weak relevance is ``RELEVANCE_LV1`` ("Weakly Relevant"). | |
| 82 | 83 | # Increment streak on consecutive bad batches; reset on any non-bad batch. Stop when streak |
| 83 | 84 | # reaches ``DEFAULT_REBUILD_IRRELEVANT_STOP_STREAK`` (default 3). |
| 84 | 85 | DEFAULT_REBUILD_IRRELEVANT_STOP_RATIO = 0.799 | ... | ... |
scripts/evaluation/eval_framework/framework.py
| ... | ... | @@ -25,14 +25,14 @@ from .constants import ( |
| 25 | 25 | DEFAULT_RERANK_HIGH_SKIP_COUNT, |
| 26 | 26 | DEFAULT_RERANK_HIGH_THRESHOLD, |
| 27 | 27 | DEFAULT_SEARCH_RECALL_TOP_K, |
| 28 | - RELEVANCE_EXACT, | |
| 29 | 28 | RELEVANCE_GAIN_MAP, |
| 30 | - RELEVANCE_HIGH, | |
| 31 | - STOP_PROB_MAP, | |
| 32 | - RELEVANCE_IRRELEVANT, | |
| 33 | - RELEVANCE_LOW, | |
| 29 | + RELEVANCE_LV0, | |
| 30 | + RELEVANCE_LV1, | |
| 31 | + RELEVANCE_LV2, | |
| 32 | + RELEVANCE_LV3, | |
| 34 | 33 | RELEVANCE_NON_IRRELEVANT, |
| 35 | 34 | VALID_LABELS, |
| 35 | + STOP_PROB_MAP, | |
| 36 | 36 | ) |
| 37 | 37 | from .metrics import ( |
| 38 | 38 | PRIMARY_METRIC_GRADE_NORMALIZER, |
| ... | ... | @@ -96,6 +96,16 @@ def _zh_titles_from_debug_per_result(debug_info: Any) -> Dict[str, str]: |
| 96 | 96 | return out |
| 97 | 97 | |
| 98 | 98 | |
| 99 | +def _encode_label_sequence(items: Sequence[Dict[str, Any]], limit: int) -> str: | |
| 100 | + parts: List[str] = [] | |
| 101 | + for item in items[:limit]: | |
| 102 | + rank = int(item.get("rank") or 0) | |
| 103 | + label = str(item.get("label") or "") | |
| 104 | + grade = RELEVANCE_GAIN_MAP.get(label) | |
| 105 | + parts.append(f"{rank}:L{grade}" if grade is not None else f"{rank}:?") | |
| 106 | + return " | ".join(parts) | |
| 107 | + | |
| 108 | + | |
| 99 | 109 | class SearchEvaluationFramework: |
| 100 | 110 | def __init__( |
| 101 | 111 | self, |
| ... | ... | @@ -168,7 +178,7 @@ class SearchEvaluationFramework: |
| 168 | 178 | ) -> Dict[str, Any]: |
| 169 | 179 | live = self.evaluate_live_query(query=query, top_k=top_k, auto_annotate=auto_annotate, language=language) |
| 170 | 180 | labels = [ |
| 171 | - item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT | |
| 181 | + item["label"] if item["label"] in VALID_LABELS else RELEVANCE_LV0 | |
| 172 | 182 | for item in live["results"] |
| 173 | 183 | ] |
| 174 | 184 | return { |
| ... | ... | @@ -432,7 +442,7 @@ class SearchEvaluationFramework: |
| 432 | 442 | |
| 433 | 443 | - ``#(Irrelevant)/n > irrelevant_stop_ratio`` (default 0.939), and |
| 434 | 444 | - ``( #(Irrelevant) + #(Weakly Relevant) ) / n > irrelevant_low_combined_stop_ratio`` |
| 435 | - (default 0.959; weak relevance = ``RELEVANCE_LOW``). | |
| 445 | + (default 0.959; weak relevance = ``RELEVANCE_LV1``). | |
| 436 | 446 | |
| 437 | 447 | Maintain a streak of consecutive *bad* batches; any non-bad batch resets the streak to 0. |
| 438 | 448 | Stop labeling when ``streak >= stop_streak`` (default 3) or when ``max_batches`` is reached |
| ... | ... | @@ -474,9 +484,9 @@ class SearchEvaluationFramework: |
| 474 | 484 | time.sleep(0.1) |
| 475 | 485 | |
| 476 | 486 | n = len(batch_docs) |
| 477 | - exact_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_EXACT) | |
| 478 | - irrel_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_IRRELEVANT) | |
| 479 | - low_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_LOW) | |
| 487 | + exact_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_LV3) | |
| 488 | + irrel_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_LV0) | |
| 489 | + low_n = sum(1 for doc in batch_docs if labels.get(str(doc.get("spu_id"))) == RELEVANCE_LV1) | |
| 480 | 490 | exact_ratio = exact_n / n if n else 0.0 |
| 481 | 491 | irrelevant_ratio = irrel_n / n if n else 0.0 |
| 482 | 492 | low_ratio = low_n / n if n else 0.0 |
| ... | ... | @@ -633,7 +643,7 @@ class SearchEvaluationFramework: |
| 633 | 643 | ) |
| 634 | 644 | |
| 635 | 645 | top100_labels = [ |
| 636 | - item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT | |
| 646 | + item["label"] if item["label"] in VALID_LABELS else RELEVANCE_LV0 | |
| 637 | 647 | for item in search_labeled_results[:100] |
| 638 | 648 | ] |
| 639 | 649 | metrics = compute_query_metrics(top100_labels, ideal_labels=list(labels.values())) |
| ... | ... | @@ -843,7 +853,7 @@ class SearchEvaluationFramework: |
| 843 | 853 | ) |
| 844 | 854 | |
| 845 | 855 | top100_labels = [ |
| 846 | - item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT | |
| 856 | + item["label"] if item["label"] in VALID_LABELS else RELEVANCE_LV0 | |
| 847 | 857 | for item in search_labeled_results[:100] |
| 848 | 858 | ] |
| 849 | 859 | metrics = compute_query_metrics(top100_labels, ideal_labels=list(labels.values())) |
| ... | ... | @@ -920,16 +930,17 @@ class SearchEvaluationFramework: |
| 920 | 930 | "title_zh": title_zh if title_zh and title_zh != primary_title else "", |
| 921 | 931 | "image_url": doc.get("image_url"), |
| 922 | 932 | "label": label, |
| 933 | + "relevance_score": doc.get("relevance_score"), | |
| 923 | 934 | "option_values": list(compact_option_values(doc.get("skus") or [])), |
| 924 | 935 | "product": compact_product_payload(doc), |
| 925 | 936 | } |
| 926 | 937 | ) |
| 927 | 938 | metric_labels = [ |
| 928 | - item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT | |
| 939 | + item["label"] if item["label"] in VALID_LABELS else RELEVANCE_LV0 | |
| 929 | 940 | for item in labeled |
| 930 | 941 | ] |
| 931 | 942 | ideal_labels = [ |
| 932 | - label if label in VALID_LABELS else RELEVANCE_IRRELEVANT | |
| 943 | + label if label in VALID_LABELS else RELEVANCE_LV0 | |
| 933 | 944 | for label in labels.values() |
| 934 | 945 | ] |
| 935 | 946 | label_stats = self.store.get_query_label_stats(self.tenant_id, query) |
| ... | ... | @@ -960,10 +971,10 @@ class SearchEvaluationFramework: |
| 960 | 971 | } |
| 961 | 972 | ) |
| 962 | 973 | label_order = { |
| 963 | - RELEVANCE_EXACT: 0, | |
| 964 | - RELEVANCE_HIGH: 1, | |
| 965 | - RELEVANCE_LOW: 2, | |
| 966 | - RELEVANCE_IRRELEVANT: 3, | |
| 974 | + RELEVANCE_LV3: 0, | |
| 975 | + RELEVANCE_LV2: 1, | |
| 976 | + RELEVANCE_LV1: 2, | |
| 977 | + RELEVANCE_LV0: 3, | |
| 967 | 978 | } |
| 968 | 979 | missing_relevant.sort( |
| 969 | 980 | key=lambda item: ( |
| ... | ... | @@ -989,6 +1000,7 @@ class SearchEvaluationFramework: |
| 989 | 1000 | "top_k": top_k, |
| 990 | 1001 | "metrics": compute_query_metrics(metric_labels, ideal_labels=ideal_labels), |
| 991 | 1002 | "metric_context": _metric_context_payload(), |
| 1003 | + "request_id": str(search_payload.get("_eval_request_id") or ""), | |
| 992 | 1004 | "results": labeled, |
| 993 | 1005 | "missing_relevant": missing_relevant, |
| 994 | 1006 | "label_stats": { |
| ... | ... | @@ -996,9 +1008,9 @@ class SearchEvaluationFramework: |
| 996 | 1008 | "unlabeled_hits_treated_irrelevant": unlabeled_hits, |
| 997 | 1009 | "recalled_hits": len(labeled), |
| 998 | 1010 | "missing_relevant_count": len(missing_relevant), |
| 999 | - "missing_exact_count": sum(1 for item in missing_relevant if item["label"] == RELEVANCE_EXACT), | |
| 1000 | - "missing_high_count": sum(1 for item in missing_relevant if item["label"] == RELEVANCE_HIGH), | |
| 1001 | - "missing_low_count": sum(1 for item in missing_relevant if item["label"] == RELEVANCE_LOW), | |
| 1011 | + "missing_exact_count": sum(1 for item in missing_relevant if item["label"] == RELEVANCE_LV3), | |
| 1012 | + "missing_high_count": sum(1 for item in missing_relevant if item["label"] == RELEVANCE_LV2), | |
| 1013 | + "missing_low_count": sum(1 for item in missing_relevant if item["label"] == RELEVANCE_LV1), | |
| 1002 | 1014 | }, |
| 1003 | 1015 | "tips": tips, |
| 1004 | 1016 | "total": int(search_payload.get("total") or 0), |
| ... | ... | @@ -1014,6 +1026,7 @@ class SearchEvaluationFramework: |
| 1014 | 1026 | force_refresh_labels: bool = False, |
| 1015 | 1027 | ) -> Dict[str, Any]: |
| 1016 | 1028 | per_query = [] |
| 1029 | + case_snapshot_top_n = min(max(int(top_k), 1), 20) | |
| 1017 | 1030 | total_q = len(queries) |
| 1018 | 1031 | _log.info("[batch-eval] starting %s queries top_k=%s auto_annotate=%s", total_q, top_k, auto_annotate) |
| 1019 | 1032 | for q_index, query in enumerate(queries, start=1): |
| ... | ... | @@ -1025,7 +1038,7 @@ class SearchEvaluationFramework: |
| 1025 | 1038 | force_refresh_labels=force_refresh_labels, |
| 1026 | 1039 | ) |
| 1027 | 1040 | labels = [ |
| 1028 | - item["label"] if item["label"] in VALID_LABELS else RELEVANCE_IRRELEVANT | |
| 1041 | + item["label"] if item["label"] in VALID_LABELS else RELEVANCE_LV0 | |
| 1029 | 1042 | for item in live["results"] |
| 1030 | 1043 | ] |
| 1031 | 1044 | per_query.append( |
| ... | ... | @@ -1036,6 +1049,21 @@ class SearchEvaluationFramework: |
| 1036 | 1049 | "metrics": live["metrics"], |
| 1037 | 1050 | "distribution": label_distribution(labels), |
| 1038 | 1051 | "total": live["total"], |
| 1052 | + "request_id": live.get("request_id") or "", | |
| 1053 | + "case_snapshot_top_n": case_snapshot_top_n, | |
| 1054 | + "top_label_sequence_top10": _encode_label_sequence(live["results"], 10), | |
| 1055 | + "top_label_sequence_top20": _encode_label_sequence(live["results"], case_snapshot_top_n), | |
| 1056 | + "top_results": [ | |
| 1057 | + { | |
| 1058 | + "rank": int(item.get("rank") or 0), | |
| 1059 | + "spu_id": str(item.get("spu_id") or ""), | |
| 1060 | + "label": item.get("label"), | |
| 1061 | + "title": item.get("title"), | |
| 1062 | + "title_zh": item.get("title_zh"), | |
| 1063 | + "relevance_score": item.get("relevance_score"), | |
| 1064 | + } | |
| 1065 | + for item in live["results"][:case_snapshot_top_n] | |
| 1066 | + ], | |
| 1039 | 1067 | } |
| 1040 | 1068 | ) |
| 1041 | 1069 | m = live["metrics"] |
| ... | ... | @@ -1055,10 +1083,10 @@ class SearchEvaluationFramework: |
| 1055 | 1083 | ) |
| 1056 | 1084 | aggregate = aggregate_metrics([item["metrics"] for item in per_query]) |
| 1057 | 1085 | aggregate_distribution = { |
| 1058 | - RELEVANCE_EXACT: sum(item["distribution"][RELEVANCE_EXACT] for item in per_query), | |
| 1059 | - RELEVANCE_HIGH: sum(item["distribution"][RELEVANCE_HIGH] for item in per_query), | |
| 1060 | - RELEVANCE_LOW: sum(item["distribution"][RELEVANCE_LOW] for item in per_query), | |
| 1061 | - RELEVANCE_IRRELEVANT: sum(item["distribution"][RELEVANCE_IRRELEVANT] for item in per_query), | |
| 1086 | + RELEVANCE_LV3: sum(item["distribution"][RELEVANCE_LV3] for item in per_query), | |
| 1087 | + RELEVANCE_LV2: sum(item["distribution"][RELEVANCE_LV2] for item in per_query), | |
| 1088 | + RELEVANCE_LV1: sum(item["distribution"][RELEVANCE_LV1] for item in per_query), | |
| 1089 | + RELEVANCE_LV0: sum(item["distribution"][RELEVANCE_LV0] for item in per_query), | |
| 1062 | 1090 | } |
| 1063 | 1091 | batch_id = f"batch_{utc_timestamp()}_{sha1_text(self.tenant_id + '|' + '|'.join(queries))[:10]}" |
| 1064 | 1092 | report_dir = ensure_dir(self.artifact_root / "batch_reports") | ... | ... |
scripts/evaluation/eval_framework/metrics.py
| ... | ... | @@ -6,12 +6,12 @@ import math |
| 6 | 6 | from typing import Dict, Iterable, Sequence |
| 7 | 7 | |
| 8 | 8 | from .constants import ( |
| 9 | - RELEVANCE_EXACT, | |
| 10 | 9 | RELEVANCE_GAIN_MAP, |
| 11 | 10 | RELEVANCE_GRADE_MAP, |
| 12 | - RELEVANCE_HIGH, | |
| 13 | - RELEVANCE_IRRELEVANT, | |
| 14 | - RELEVANCE_LOW, | |
| 11 | + RELEVANCE_LV0, | |
| 12 | + RELEVANCE_LV1, | |
| 13 | + RELEVANCE_LV2, | |
| 14 | + RELEVANCE_LV3, | |
| 15 | 15 | RELEVANCE_NON_IRRELEVANT, |
| 16 | 16 | RELEVANCE_STRONG, |
| 17 | 17 | STOP_PROB_MAP, |
| ... | ... | @@ -33,7 +33,7 @@ PRIMARY_METRIC_GRADE_NORMALIZER = float(max(RELEVANCE_GRADE_MAP.values()) or 1.0 |
| 33 | 33 | def _normalize_label(label: str) -> str: |
| 34 | 34 | if label in RELEVANCE_GRADE_MAP: |
| 35 | 35 | return label |
| 36 | - return RELEVANCE_IRRELEVANT | |
| 36 | + return RELEVANCE_LV0 | |
| 37 | 37 | |
| 38 | 38 | |
| 39 | 39 | def _gains_for_labels(labels: Sequence[str]) -> list[float]: |
| ... | ... | @@ -135,7 +135,7 @@ def compute_query_metrics( |
| 135 | 135 | ideal = list(ideal_labels) if ideal_labels is not None else list(labels) |
| 136 | 136 | metrics: Dict[str, float] = {} |
| 137 | 137 | |
| 138 | - exact_hits = _binary_hits(labels, [RELEVANCE_EXACT]) | |
| 138 | + exact_hits = _binary_hits(labels, [RELEVANCE_LV3]) | |
| 139 | 139 | strong_hits = _binary_hits(labels, RELEVANCE_STRONG) |
| 140 | 140 | useful_hits = _binary_hits(labels, RELEVANCE_NON_IRRELEVANT) |
| 141 | 141 | |
| ... | ... | @@ -183,8 +183,8 @@ def aggregate_metrics(metric_items: Sequence[Dict[str, float]]) -> Dict[str, flo |
| 183 | 183 | |
| 184 | 184 | def label_distribution(labels: Sequence[str]) -> Dict[str, int]: |
| 185 | 185 | return { |
| 186 | - RELEVANCE_EXACT: sum(1 for label in labels if label == RELEVANCE_EXACT), | |
| 187 | - RELEVANCE_HIGH: sum(1 for label in labels if label == RELEVANCE_HIGH), | |
| 188 | - RELEVANCE_LOW: sum(1 for label in labels if label == RELEVANCE_LOW), | |
| 189 | - RELEVANCE_IRRELEVANT: sum(1 for label in labels if label == RELEVANCE_IRRELEVANT), | |
| 186 | + RELEVANCE_LV3: sum(1 for label in labels if label == RELEVANCE_LV3), | |
| 187 | + RELEVANCE_LV2: sum(1 for label in labels if label == RELEVANCE_LV2), | |
| 188 | + RELEVANCE_LV1: sum(1 for label in labels if label == RELEVANCE_LV1), | |
| 189 | + RELEVANCE_LV0: sum(1 for label in labels if label == RELEVANCE_LV0), | |
| 190 | 190 | } | ... | ... |
scripts/evaluation/eval_framework/reports.py
| ... | ... | @@ -4,7 +4,7 @@ from __future__ import annotations |
| 4 | 4 | |
| 5 | 5 | from typing import Any, Dict |
| 6 | 6 | |
| 7 | -from .constants import RELEVANCE_EXACT, RELEVANCE_HIGH, RELEVANCE_IRRELEVANT, RELEVANCE_LOW | |
| 7 | +from .constants import RELEVANCE_GAIN_MAP, RELEVANCE_LV0, RELEVANCE_LV1, RELEVANCE_LV2, RELEVANCE_LV3 | |
| 8 | 8 | from .metrics import PRIMARY_METRIC_KEYS |
| 9 | 9 | |
| 10 | 10 | |
| ... | ... | @@ -25,6 +25,38 @@ def _append_metric_block(lines: list[str], metrics: Dict[str, Any]) -> None: |
| 25 | 25 | lines.append(f"- {key}: {value}") |
| 26 | 26 | |
| 27 | 27 | |
| 28 | +def _label_level_code(label: str) -> str: | |
| 29 | + grade = RELEVANCE_GAIN_MAP.get(label) | |
| 30 | + return f"L{grade}" if grade is not None else "?" | |
| 31 | + | |
| 32 | + | |
| 33 | +def _append_case_snapshot(lines: list[str], item: Dict[str, Any]) -> None: | |
| 34 | + request_id = str(item.get("request_id") or "").strip() | |
| 35 | + if request_id: | |
| 36 | + lines.append(f"- Request ID: `{request_id}`") | |
| 37 | + seq10 = str(item.get("top_label_sequence_top10") or "").strip() | |
| 38 | + if seq10: | |
| 39 | + lines.append(f"- Top-10 Labels: `{seq10}`") | |
| 40 | + seq20 = str(item.get("top_label_sequence_top20") or "").strip() | |
| 41 | + if seq20 and seq20 != seq10: | |
| 42 | + lines.append(f"- Top-20 Labels: `{seq20}`") | |
| 43 | + top_results = item.get("top_results") or [] | |
| 44 | + if not top_results: | |
| 45 | + return | |
| 46 | + lines.append("- Case Snapshot:") | |
| 47 | + for result in top_results[:5]: | |
| 48 | + rank = int(result.get("rank") or 0) | |
| 49 | + label = _label_level_code(str(result.get("label") or "")) | |
| 50 | + spu_id = str(result.get("spu_id") or "") | |
| 51 | + title = str(result.get("title") or "") | |
| 52 | + title_zh = str(result.get("title_zh") or "") | |
| 53 | + relevance_score = result.get("relevance_score") | |
| 54 | + score_suffix = f" (rel={relevance_score})" if relevance_score not in (None, "") else "" | |
| 55 | + lines.append(f" - #{rank} [{label}] spu={spu_id} {title}{score_suffix}") | |
| 56 | + if title_zh: | |
| 57 | + lines.append(f" zh: {title_zh}") | |
| 58 | + | |
| 59 | + | |
| 28 | 60 | def render_batch_report_markdown(payload: Dict[str, Any]) -> str: |
| 29 | 61 | lines = [ |
| 30 | 62 | "# Search Batch Evaluation", |
| ... | ... | @@ -56,10 +88,10 @@ def render_batch_report_markdown(payload: Dict[str, Any]) -> str: |
| 56 | 88 | "", |
| 57 | 89 | "## Label Distribution", |
| 58 | 90 | "", |
| 59 | - f"- Fully Relevant: {distribution.get(RELEVANCE_EXACT, 0)}", | |
| 60 | - f"- Mostly Relevant: {distribution.get(RELEVANCE_HIGH, 0)}", | |
| 61 | - f"- Weakly Relevant: {distribution.get(RELEVANCE_LOW, 0)}", | |
| 62 | - f"- Irrelevant: {distribution.get(RELEVANCE_IRRELEVANT, 0)}", | |
| 91 | + f"- Fully Relevant: {distribution.get(RELEVANCE_LV3, 0)}", | |
| 92 | + f"- Mostly Relevant: {distribution.get(RELEVANCE_LV2, 0)}", | |
| 93 | + f"- Weakly Relevant: {distribution.get(RELEVANCE_LV1, 0)}", | |
| 94 | + f"- Irrelevant: {distribution.get(RELEVANCE_LV0, 0)}", | |
| 63 | 95 | ] |
| 64 | 96 | ) |
| 65 | 97 | lines.extend(["", "## Per Query", ""]) |
| ... | ... | @@ -68,9 +100,10 @@ def render_batch_report_markdown(payload: Dict[str, Any]) -> str: |
| 68 | 100 | lines.append("") |
| 69 | 101 | _append_metric_block(lines, item.get("metrics") or {}) |
| 70 | 102 | distribution = item.get("distribution") or {} |
| 71 | - lines.append(f"- Fully Relevant: {distribution.get(RELEVANCE_EXACT, 0)}") | |
| 72 | - lines.append(f"- Mostly Relevant: {distribution.get(RELEVANCE_HIGH, 0)}") | |
| 73 | - lines.append(f"- Weakly Relevant: {distribution.get(RELEVANCE_LOW, 0)}") | |
| 74 | - lines.append(f"- Irrelevant: {distribution.get(RELEVANCE_IRRELEVANT, 0)}") | |
| 103 | + lines.append(f"- Fully Relevant: {distribution.get(RELEVANCE_LV3, 0)}") | |
| 104 | + lines.append(f"- Mostly Relevant: {distribution.get(RELEVANCE_LV2, 0)}") | |
| 105 | + lines.append(f"- Weakly Relevant: {distribution.get(RELEVANCE_LV1, 0)}") | |
| 106 | + lines.append(f"- Irrelevant: {distribution.get(RELEVANCE_LV0, 0)}") | |
| 107 | + _append_case_snapshot(lines, item) | |
| 75 | 108 | lines.append("") |
| 76 | 109 | return "\n".join(lines) | ... | ... |
scripts/evaluation/eval_framework/static/eval_web.js
| ... | ... | @@ -190,7 +190,7 @@ async function loadQueries() { |
| 190 | 190 | |
| 191 | 191 | function historySummaryHtml(meta) { |
| 192 | 192 | const m = meta && meta.aggregate_metrics; |
| 193 | - const nq = (meta && meta.queries && meta.queries.length) || (meta && meta.per_query && meta.per_query.length) || null; | |
| 193 | + const nq = (meta && meta.query_count) || (meta && meta.queries && meta.queries.length) || (meta && meta.per_query && meta.per_query.length) || null; | |
| 194 | 194 | const parts = []; |
| 195 | 195 | if (nq != null) parts.push(`<span>Queries</span> ${nq}`); |
| 196 | 196 | if (m && m["Primary_Metric_Score"] != null) parts.push(`<span>Primary</span> ${fmtNumber(m["Primary_Metric_Score"])}`); | ... | ... |
scripts/evaluation/eval_framework/store.py
| ... | ... | @@ -23,6 +23,18 @@ class QueryBuildResult: |
| 23 | 23 | output_json_path: Path |
| 24 | 24 | |
| 25 | 25 | |
| 26 | +def _compact_batch_metadata(metadata: Dict[str, Any]) -> Dict[str, Any]: | |
| 27 | + return { | |
| 28 | + "batch_id": metadata.get("batch_id"), | |
| 29 | + "created_at": metadata.get("created_at"), | |
| 30 | + "tenant_id": metadata.get("tenant_id"), | |
| 31 | + "top_k": metadata.get("top_k"), | |
| 32 | + "query_count": len(metadata.get("queries") or []), | |
| 33 | + "aggregate_metrics": dict(metadata.get("aggregate_metrics") or {}), | |
| 34 | + "metric_context": dict(metadata.get("metric_context") or {}), | |
| 35 | + } | |
| 36 | + | |
| 37 | + | |
| 26 | 38 | class EvalStore: |
| 27 | 39 | def __init__(self, db_path: Path): |
| 28 | 40 | self.db_path = db_path |
| ... | ... | @@ -339,6 +351,7 @@ class EvalStore: |
| 339 | 351 | ).fetchall() |
| 340 | 352 | items: List[Dict[str, Any]] = [] |
| 341 | 353 | for row in rows: |
| 354 | + metadata = json.loads(row["metadata_json"]) | |
| 342 | 355 | items.append( |
| 343 | 356 | { |
| 344 | 357 | "batch_id": row["batch_id"], |
| ... | ... | @@ -346,7 +359,7 @@ class EvalStore: |
| 346 | 359 | "output_json_path": row["output_json_path"], |
| 347 | 360 | "report_markdown_path": row["report_markdown_path"], |
| 348 | 361 | "config_snapshot_path": row["config_snapshot_path"], |
| 349 | - "metadata": json.loads(row["metadata_json"]), | |
| 362 | + "metadata": _compact_batch_metadata(metadata), | |
| 350 | 363 | "created_at": row["created_at"], |
| 351 | 364 | } |
| 352 | 365 | ) | ... | ... |
scripts/evaluation/offline_ltr_fit.py
| ... | ... | @@ -23,11 +23,11 @@ if str(PROJECT_ROOT) not in sys.path: |
| 23 | 23 | |
| 24 | 24 | from scripts.evaluation.eval_framework.constants import ( |
| 25 | 25 | DEFAULT_ARTIFACT_ROOT, |
| 26 | - RELEVANCE_EXACT, | |
| 27 | 26 | RELEVANCE_GRADE_MAP, |
| 28 | - RELEVANCE_HIGH, | |
| 29 | - RELEVANCE_IRRELEVANT, | |
| 30 | - RELEVANCE_LOW, | |
| 27 | + RELEVANCE_LV0, | |
| 28 | + RELEVANCE_LV1, | |
| 29 | + RELEVANCE_LV2, | |
| 30 | + RELEVANCE_LV3, | |
| 31 | 31 | ) |
| 32 | 32 | from scripts.evaluation.eval_framework.metrics import aggregate_metrics, compute_query_metrics |
| 33 | 33 | from scripts.evaluation.eval_framework.store import EvalStore |
| ... | ... | @@ -35,10 +35,10 @@ from scripts.evaluation.eval_framework.utils import ensure_dir, utc_timestamp |
| 35 | 35 | |
| 36 | 36 | |
| 37 | 37 | LABELS_BY_GRADE = { |
| 38 | - 3: RELEVANCE_EXACT, | |
| 39 | - 2: RELEVANCE_HIGH, | |
| 40 | - 1: RELEVANCE_LOW, | |
| 41 | - 0: RELEVANCE_IRRELEVANT, | |
| 38 | + 3: RELEVANCE_LV3, | |
| 39 | + 2: RELEVANCE_LV2, | |
| 40 | + 1: RELEVANCE_LV1, | |
| 41 | + 0: RELEVANCE_LV0, | |
| 42 | 42 | } |
| 43 | 43 | |
| 44 | 44 | ... | ... |
| ... | ... | @@ -0,0 +1,278 @@ |
| 1 | +#!/usr/bin/env python3 | |
| 2 | +""" | |
| 3 | +Simple HTTP server for saas-search frontend. | |
| 4 | +""" | |
| 5 | + | |
| 6 | +import http.server | |
| 7 | +import socketserver | |
| 8 | +import os | |
| 9 | +import sys | |
| 10 | +import logging | |
| 11 | +import time | |
| 12 | +import urllib.request | |
| 13 | +import urllib.error | |
| 14 | +from collections import defaultdict, deque | |
| 15 | +from pathlib import Path | |
| 16 | +from dotenv import load_dotenv | |
| 17 | + | |
| 18 | +# Load .env file | |
| 19 | +project_root = Path(__file__).resolve().parents[2] | |
| 20 | +load_dotenv(project_root / '.env') | |
| 21 | + | |
| 22 | +# Get API_BASE_URL from environment(默认不注入,避免被旧 .env 覆盖同源策略) | |
| 23 | +# 仅当显式设置 FRONTEND_INJECT_API_BASE_URL=1 时才注入 window.API_BASE_URL。 | |
| 24 | +API_BASE_URL = os.getenv('API_BASE_URL') or None | |
| 25 | +INJECT_API_BASE_URL = os.getenv('FRONTEND_INJECT_API_BASE_URL', '0') == '1' | |
| 26 | +# Backend proxy target for same-origin API forwarding | |
| 27 | +BACKEND_PROXY_URL = os.getenv('BACKEND_PROXY_URL', 'http://127.0.0.1:6002').rstrip('/') | |
| 28 | + | |
| 29 | +# Change to frontend directory | |
| 30 | +frontend_dir = os.path.join(project_root, 'frontend') | |
| 31 | +os.chdir(frontend_dir) | |
| 32 | + | |
| 33 | +# FRONTEND_PORT is the canonical config; keep PORT as a secondary fallback. | |
| 34 | +PORT = int(os.getenv('FRONTEND_PORT', os.getenv('PORT', 6003))) | |
| 35 | + | |
| 36 | +# Configure logging to suppress scanner noise | |
| 37 | +logging.basicConfig(level=logging.ERROR, format='%(asctime)s - %(levelname)s - %(message)s') | |
| 38 | + | |
| 39 | +class RateLimitingMixin: | |
| 40 | + """Mixin for rate limiting requests by IP address.""" | |
| 41 | + request_counts = defaultdict(deque) | |
| 42 | + rate_limit = 100 # requests per minute | |
| 43 | + window = 60 # seconds | |
| 44 | + | |
| 45 | + @classmethod | |
| 46 | + def is_rate_limited(cls, ip): | |
| 47 | + now = time.time() | |
| 48 | + | |
| 49 | + # Clean old requests | |
| 50 | + while cls.request_counts[ip] and cls.request_counts[ip][0] < now - cls.window: | |
| 51 | + cls.request_counts[ip].popleft() | |
| 52 | + | |
| 53 | + # Check rate limit | |
| 54 | + if len(cls.request_counts[ip]) > cls.rate_limit: | |
| 55 | + return True | |
| 56 | + | |
| 57 | + cls.request_counts[ip].append(now) | |
| 58 | + return False | |
| 59 | + | |
| 60 | +class MyHTTPRequestHandler(http.server.SimpleHTTPRequestHandler, RateLimitingMixin): | |
| 61 | + """Custom request handler with CORS support and robust error handling.""" | |
| 62 | + | |
| 63 | + _ALLOWED_CORS_HEADERS = "Content-Type, X-Tenant-ID, X-Request-ID, Referer" | |
| 64 | + | |
| 65 | + def _is_proxy_path(self, path: str) -> bool: | |
| 66 | + """Return True for API paths that should be forwarded to backend service.""" | |
| 67 | + return path.startswith('/search/') or path.startswith('/admin/') or path.startswith('/indexer/') | |
| 68 | + | |
| 69 | + def _proxy_to_backend(self): | |
| 70 | + """Proxy current request to backend service on the GPU server.""" | |
| 71 | + target_url = f"{BACKEND_PROXY_URL}{self.path}" | |
| 72 | + method = self.command.upper() | |
| 73 | + | |
| 74 | + try: | |
| 75 | + content_length = int(self.headers.get('Content-Length', '0')) | |
| 76 | + except ValueError: | |
| 77 | + content_length = 0 | |
| 78 | + body = self.rfile.read(content_length) if content_length > 0 else None | |
| 79 | + | |
| 80 | + forward_headers = {} | |
| 81 | + for key, value in self.headers.items(): | |
| 82 | + lk = key.lower() | |
| 83 | + if lk in ('host', 'content-length', 'connection'): | |
| 84 | + continue | |
| 85 | + forward_headers[key] = value | |
| 86 | + | |
| 87 | + req = urllib.request.Request( | |
| 88 | + target_url, | |
| 89 | + data=body, | |
| 90 | + headers=forward_headers, | |
| 91 | + method=method, | |
| 92 | + ) | |
| 93 | + | |
| 94 | + try: | |
| 95 | + with urllib.request.urlopen(req, timeout=30) as resp: | |
| 96 | + resp_body = resp.read() | |
| 97 | + self.send_response(resp.getcode()) | |
| 98 | + for header, value in resp.getheaders(): | |
| 99 | + lh = header.lower() | |
| 100 | + if lh in ('transfer-encoding', 'connection', 'content-length'): | |
| 101 | + continue | |
| 102 | + self.send_header(header, value) | |
| 103 | + self.end_headers() | |
| 104 | + self.wfile.write(resp_body) | |
| 105 | + except urllib.error.HTTPError as e: | |
| 106 | + err_body = e.read() if hasattr(e, 'read') else b'' | |
| 107 | + self.send_response(e.code) | |
| 108 | + if e.headers: | |
| 109 | + for header, value in e.headers.items(): | |
| 110 | + lh = header.lower() | |
| 111 | + if lh in ('transfer-encoding', 'connection', 'content-length'): | |
| 112 | + continue | |
| 113 | + self.send_header(header, value) | |
| 114 | + self.end_headers() | |
| 115 | + if err_body: | |
| 116 | + self.wfile.write(err_body) | |
| 117 | + except Exception as e: | |
| 118 | + logging.error(f"Backend proxy error for {method} {self.path}: {e}") | |
| 119 | + self.send_response(502) | |
| 120 | + self.send_header('Content-Type', 'application/json; charset=utf-8') | |
| 121 | + self.end_headers() | |
| 122 | + self.wfile.write(b'{"error":"Bad Gateway: backend proxy failed"}') | |
| 123 | + | |
| 124 | + def do_GET(self): | |
| 125 | + """Handle GET requests with API config injection.""" | |
| 126 | + path = self.path.split('?')[0] | |
| 127 | + | |
| 128 | + # Proxy API paths to backend first | |
| 129 | + if self._is_proxy_path(path): | |
| 130 | + self._proxy_to_backend() | |
| 131 | + return | |
| 132 | + | |
| 133 | + # Route / to index.html | |
| 134 | + if path == '/' or path == '': | |
| 135 | + self.path = '/index.html' + (self.path.split('?', 1)[1] if '?' in self.path else '') | |
| 136 | + | |
| 137 | + # Inject API config for HTML files | |
| 138 | + if self.path.endswith('.html'): | |
| 139 | + self._serve_html_with_config() | |
| 140 | + else: | |
| 141 | + super().do_GET() | |
| 142 | + | |
| 143 | + def _serve_html_with_config(self): | |
| 144 | + """Serve HTML with optional API_BASE_URL injected.""" | |
| 145 | + try: | |
| 146 | + file_path = self.path.lstrip('/') | |
| 147 | + if not os.path.exists(file_path): | |
| 148 | + self.send_error(404) | |
| 149 | + return | |
| 150 | + | |
| 151 | + with open(file_path, 'r', encoding='utf-8') as f: | |
| 152 | + html = f.read() | |
| 153 | + | |
| 154 | + # 默认不注入 API_BASE_URL,避免历史 .env(如 http://xx:6002)覆盖同源调用。 | |
| 155 | + # 仅当 FRONTEND_INJECT_API_BASE_URL=1 且 API_BASE_URL 有值时才注入。 | |
| 156 | + if INJECT_API_BASE_URL and API_BASE_URL: | |
| 157 | + config_script = f'<script>window.API_BASE_URL="{API_BASE_URL}";</script>\n ' | |
| 158 | + html = html.replace('<script src="/static/js/app.js', config_script + '<script src="/static/js/app.js', 1) | |
| 159 | + | |
| 160 | + self.send_response(200) | |
| 161 | + self.send_header('Content-Type', 'text/html; charset=utf-8') | |
| 162 | + self.end_headers() | |
| 163 | + self.wfile.write(html.encode('utf-8')) | |
| 164 | + except Exception as e: | |
| 165 | + logging.error(f"Error serving HTML: {e}") | |
| 166 | + self.send_error(500) | |
| 167 | + | |
| 168 | + def do_POST(self): | |
| 169 | + """Handle POST requests. Proxy API requests to backend.""" | |
| 170 | + path = self.path.split('?')[0] | |
| 171 | + if self._is_proxy_path(path): | |
| 172 | + self._proxy_to_backend() | |
| 173 | + return | |
| 174 | + self.send_error(405, "Method Not Allowed") | |
| 175 | + | |
| 176 | + def setup(self): | |
| 177 | + """Setup with error handling.""" | |
| 178 | + try: | |
| 179 | + super().setup() | |
| 180 | + except Exception: | |
| 181 | + pass # Silently handle setup errors from scanners | |
| 182 | + | |
| 183 | + def handle_one_request(self): | |
| 184 | + """Handle single request with error catching.""" | |
| 185 | + try: | |
| 186 | + # Check rate limiting | |
| 187 | + client_ip = self.client_address[0] | |
| 188 | + if self.is_rate_limited(client_ip): | |
| 189 | + logging.warning(f"Rate limiting IP: {client_ip}") | |
| 190 | + self.send_error(429, "Too Many Requests") | |
| 191 | + return | |
| 192 | + | |
| 193 | + super().handle_one_request() | |
| 194 | + except (ConnectionResetError, BrokenPipeError): | |
| 195 | + # Client disconnected prematurely - common with scanners | |
| 196 | + pass | |
| 197 | + except UnicodeDecodeError: | |
| 198 | + # Binary data received - not HTTP | |
| 199 | + pass | |
| 200 | + except Exception as e: | |
| 201 | + # Log unexpected errors but don't crash | |
| 202 | + logging.debug(f"Request handling error: {e}") | |
| 203 | + | |
| 204 | + def log_message(self, format, *args): | |
| 205 | + """Suppress logging for malformed requests from scanners.""" | |
| 206 | + message = format % args | |
| 207 | + # Filter out scanner noise | |
| 208 | + noise_patterns = [ | |
| 209 | + "code 400", | |
| 210 | + "Bad request", | |
| 211 | + "Bad request version", | |
| 212 | + "Bad HTTP/0.9 request type", | |
| 213 | + "Bad request syntax" | |
| 214 | + ] | |
| 215 | + if any(pattern in message for pattern in noise_patterns): | |
| 216 | + return | |
| 217 | + # Only log legitimate requests | |
| 218 | + if message and not message.startswith(" ") and len(message) > 10: | |
| 219 | + super().log_message(format, *args) | |
| 220 | + | |
| 221 | + def end_headers(self): | |
| 222 | + # Add CORS headers | |
| 223 | + self.send_header('Access-Control-Allow-Origin', '*') | |
| 224 | + self.send_header('Access-Control-Allow-Methods', 'GET, POST, OPTIONS') | |
| 225 | + self.send_header('Access-Control-Allow-Headers', self._ALLOWED_CORS_HEADERS) | |
| 226 | + # Add security headers | |
| 227 | + self.send_header('X-Content-Type-Options', 'nosniff') | |
| 228 | + self.send_header('X-Frame-Options', 'DENY') | |
| 229 | + self.send_header('X-XSS-Protection', '1; mode=block') | |
| 230 | + super().end_headers() | |
| 231 | + | |
| 232 | + def do_OPTIONS(self): | |
| 233 | + """Handle OPTIONS requests.""" | |
| 234 | + try: | |
| 235 | + path = self.path.split('?')[0] | |
| 236 | + if self._is_proxy_path(path): | |
| 237 | + self.send_response(204) | |
| 238 | + self.end_headers() | |
| 239 | + return | |
| 240 | + self.send_response(200) | |
| 241 | + self.end_headers() | |
| 242 | + except Exception: | |
| 243 | + pass | |
| 244 | + | |
| 245 | +class ThreadedTCPServer(socketserver.ThreadingMixIn, socketserver.TCPServer): | |
| 246 | + """Threaded TCP server with better error handling.""" | |
| 247 | + allow_reuse_address = True | |
| 248 | + daemon_threads = True | |
| 249 | + | |
| 250 | +if __name__ == '__main__': | |
| 251 | + # Check if port is already in use | |
| 252 | + import socket | |
| 253 | + sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) | |
| 254 | + try: | |
| 255 | + sock.bind(("", PORT)) | |
| 256 | + sock.close() | |
| 257 | + except OSError: | |
| 258 | + print(f"ERROR: Port {PORT} is already in use.") | |
| 259 | + print(f"Please stop the existing server or use a different port.") | |
| 260 | + print(f"To stop existing server: kill $(lsof -t -i:{PORT})") | |
| 261 | + sys.exit(1) | |
| 262 | + | |
| 263 | + # Create threaded server for better concurrency | |
| 264 | + with ThreadedTCPServer(("", PORT), MyHTTPRequestHandler) as httpd: | |
| 265 | + print(f"Frontend server started at http://localhost:{PORT}") | |
| 266 | + print(f"Serving files from: {os.getcwd()}") | |
| 267 | + print("\nPress Ctrl+C to stop the server") | |
| 268 | + | |
| 269 | + try: | |
| 270 | + httpd.serve_forever() | |
| 271 | + except KeyboardInterrupt: | |
| 272 | + print("\nShutting down server...") | |
| 273 | + httpd.shutdown() | |
| 274 | + print("Server stopped") | |
| 275 | + sys.exit(0) | |
| 276 | + except Exception as e: | |
| 277 | + print(f"Server error: {e}") | |
| 278 | + sys.exit(1) | ... | ... |
| 1 | 1 | #!/usr/bin/env python3 |
| 2 | -""" | |
| 3 | -Simple HTTP server for saas-search frontend. | |
| 4 | -""" | |
| 2 | +"""Backward-compatible frontend server entrypoint.""" | |
| 5 | 3 | |
| 6 | -import http.server | |
| 7 | -import socketserver | |
| 8 | -import os | |
| 9 | -import sys | |
| 10 | -import logging | |
| 11 | -import time | |
| 12 | -import urllib.request | |
| 13 | -import urllib.error | |
| 14 | -from collections import defaultdict, deque | |
| 15 | -from pathlib import Path | |
| 16 | -from dotenv import load_dotenv | |
| 17 | - | |
| 18 | -# Load .env file | |
| 19 | -project_root = Path(__file__).parent.parent | |
| 20 | -load_dotenv(project_root / '.env') | |
| 21 | - | |
| 22 | -# Get API_BASE_URL from environment(默认不注入,避免被旧 .env 覆盖同源策略) | |
| 23 | -# 仅当显式设置 FRONTEND_INJECT_API_BASE_URL=1 时才注入 window.API_BASE_URL。 | |
| 24 | -API_BASE_URL = os.getenv('API_BASE_URL') or None | |
| 25 | -INJECT_API_BASE_URL = os.getenv('FRONTEND_INJECT_API_BASE_URL', '0') == '1' | |
| 26 | -# Backend proxy target for same-origin API forwarding | |
| 27 | -BACKEND_PROXY_URL = os.getenv('BACKEND_PROXY_URL', 'http://127.0.0.1:6002').rstrip('/') | |
| 28 | - | |
| 29 | -# Change to frontend directory | |
| 30 | -frontend_dir = os.path.join(os.path.dirname(__file__), '../frontend') | |
| 31 | -os.chdir(frontend_dir) | |
| 32 | - | |
| 33 | -# FRONTEND_PORT is the canonical config; keep PORT as a secondary fallback. | |
| 34 | -PORT = int(os.getenv('FRONTEND_PORT', os.getenv('PORT', 6003))) | |
| 35 | - | |
| 36 | -# Configure logging to suppress scanner noise | |
| 37 | -logging.basicConfig(level=logging.ERROR, format='%(asctime)s - %(levelname)s - %(message)s') | |
| 38 | - | |
| 39 | -class RateLimitingMixin: | |
| 40 | - """Mixin for rate limiting requests by IP address.""" | |
| 41 | - request_counts = defaultdict(deque) | |
| 42 | - rate_limit = 100 # requests per minute | |
| 43 | - window = 60 # seconds | |
| 44 | - | |
| 45 | - @classmethod | |
| 46 | - def is_rate_limited(cls, ip): | |
| 47 | - now = time.time() | |
| 48 | - | |
| 49 | - # Clean old requests | |
| 50 | - while cls.request_counts[ip] and cls.request_counts[ip][0] < now - cls.window: | |
| 51 | - cls.request_counts[ip].popleft() | |
| 52 | - | |
| 53 | - # Check rate limit | |
| 54 | - if len(cls.request_counts[ip]) > cls.rate_limit: | |
| 55 | - return True | |
| 56 | - | |
| 57 | - cls.request_counts[ip].append(now) | |
| 58 | - return False | |
| 59 | - | |
| 60 | -class MyHTTPRequestHandler(http.server.SimpleHTTPRequestHandler, RateLimitingMixin): | |
| 61 | - """Custom request handler with CORS support and robust error handling.""" | |
| 62 | - | |
| 63 | - def _is_proxy_path(self, path: str) -> bool: | |
| 64 | - """Return True for API paths that should be forwarded to backend service.""" | |
| 65 | - return path.startswith('/search/') or path.startswith('/admin/') or path.startswith('/indexer/') | |
| 66 | - | |
| 67 | - def _proxy_to_backend(self): | |
| 68 | - """Proxy current request to backend service on the GPU server.""" | |
| 69 | - target_url = f"{BACKEND_PROXY_URL}{self.path}" | |
| 70 | - method = self.command.upper() | |
| 71 | - | |
| 72 | - try: | |
| 73 | - content_length = int(self.headers.get('Content-Length', '0')) | |
| 74 | - except ValueError: | |
| 75 | - content_length = 0 | |
| 76 | - body = self.rfile.read(content_length) if content_length > 0 else None | |
| 4 | +from __future__ import annotations | |
| 77 | 5 | |
| 78 | - forward_headers = {} | |
| 79 | - for key, value in self.headers.items(): | |
| 80 | - lk = key.lower() | |
| 81 | - if lk in ('host', 'content-length', 'connection'): | |
| 82 | - continue | |
| 83 | - forward_headers[key] = value | |
| 84 | - | |
| 85 | - req = urllib.request.Request( | |
| 86 | - target_url, | |
| 87 | - data=body, | |
| 88 | - headers=forward_headers, | |
| 89 | - method=method, | |
| 90 | - ) | |
| 91 | - | |
| 92 | - try: | |
| 93 | - with urllib.request.urlopen(req, timeout=30) as resp: | |
| 94 | - resp_body = resp.read() | |
| 95 | - self.send_response(resp.getcode()) | |
| 96 | - for header, value in resp.getheaders(): | |
| 97 | - lh = header.lower() | |
| 98 | - if lh in ('transfer-encoding', 'connection', 'content-length'): | |
| 99 | - continue | |
| 100 | - self.send_header(header, value) | |
| 101 | - self.end_headers() | |
| 102 | - self.wfile.write(resp_body) | |
| 103 | - except urllib.error.HTTPError as e: | |
| 104 | - err_body = e.read() if hasattr(e, 'read') else b'' | |
| 105 | - self.send_response(e.code) | |
| 106 | - if e.headers: | |
| 107 | - for header, value in e.headers.items(): | |
| 108 | - lh = header.lower() | |
| 109 | - if lh in ('transfer-encoding', 'connection', 'content-length'): | |
| 110 | - continue | |
| 111 | - self.send_header(header, value) | |
| 112 | - self.end_headers() | |
| 113 | - if err_body: | |
| 114 | - self.wfile.write(err_body) | |
| 115 | - except Exception as e: | |
| 116 | - logging.error(f"Backend proxy error for {method} {self.path}: {e}") | |
| 117 | - self.send_response(502) | |
| 118 | - self.send_header('Content-Type', 'application/json; charset=utf-8') | |
| 119 | - self.end_headers() | |
| 120 | - self.wfile.write(b'{"error":"Bad Gateway: backend proxy failed"}') | |
| 121 | - | |
| 122 | - def do_GET(self): | |
| 123 | - """Handle GET requests with API config injection.""" | |
| 124 | - path = self.path.split('?')[0] | |
| 125 | - | |
| 126 | - # Proxy API paths to backend first | |
| 127 | - if self._is_proxy_path(path): | |
| 128 | - self._proxy_to_backend() | |
| 129 | - return | |
| 130 | - | |
| 131 | - # Route / to index.html | |
| 132 | - if path == '/' or path == '': | |
| 133 | - self.path = '/index.html' + (self.path.split('?', 1)[1] if '?' in self.path else '') | |
| 134 | - | |
| 135 | - # Inject API config for HTML files | |
| 136 | - if self.path.endswith('.html'): | |
| 137 | - self._serve_html_with_config() | |
| 138 | - else: | |
| 139 | - super().do_GET() | |
| 140 | - | |
| 141 | - def _serve_html_with_config(self): | |
| 142 | - """Serve HTML with optional API_BASE_URL injected.""" | |
| 143 | - try: | |
| 144 | - file_path = self.path.lstrip('/') | |
| 145 | - if not os.path.exists(file_path): | |
| 146 | - self.send_error(404) | |
| 147 | - return | |
| 148 | - | |
| 149 | - with open(file_path, 'r', encoding='utf-8') as f: | |
| 150 | - html = f.read() | |
| 151 | - | |
| 152 | - # 默认不注入 API_BASE_URL,避免历史 .env(如 http://xx:6002)覆盖同源调用。 | |
| 153 | - # 仅当 FRONTEND_INJECT_API_BASE_URL=1 且 API_BASE_URL 有值时才注入。 | |
| 154 | - if INJECT_API_BASE_URL and API_BASE_URL: | |
| 155 | - config_script = f'<script>window.API_BASE_URL="{API_BASE_URL}";</script>\n ' | |
| 156 | - html = html.replace('<script src="/static/js/app.js', config_script + '<script src="/static/js/app.js', 1) | |
| 157 | - | |
| 158 | - self.send_response(200) | |
| 159 | - self.send_header('Content-Type', 'text/html; charset=utf-8') | |
| 160 | - self.end_headers() | |
| 161 | - self.wfile.write(html.encode('utf-8')) | |
| 162 | - except Exception as e: | |
| 163 | - logging.error(f"Error serving HTML: {e}") | |
| 164 | - self.send_error(500) | |
| 165 | - | |
| 166 | - def do_POST(self): | |
| 167 | - """Handle POST requests. Proxy API requests to backend.""" | |
| 168 | - path = self.path.split('?')[0] | |
| 169 | - if self._is_proxy_path(path): | |
| 170 | - self._proxy_to_backend() | |
| 171 | - return | |
| 172 | - self.send_error(405, "Method Not Allowed") | |
| 173 | - | |
| 174 | - def setup(self): | |
| 175 | - """Setup with error handling.""" | |
| 176 | - try: | |
| 177 | - super().setup() | |
| 178 | - except Exception: | |
| 179 | - pass # Silently handle setup errors from scanners | |
| 180 | - | |
| 181 | - def handle_one_request(self): | |
| 182 | - """Handle single request with error catching.""" | |
| 183 | - try: | |
| 184 | - # Check rate limiting | |
| 185 | - client_ip = self.client_address[0] | |
| 186 | - if self.is_rate_limited(client_ip): | |
| 187 | - logging.warning(f"Rate limiting IP: {client_ip}") | |
| 188 | - self.send_error(429, "Too Many Requests") | |
| 189 | - return | |
| 190 | - | |
| 191 | - super().handle_one_request() | |
| 192 | - except (ConnectionResetError, BrokenPipeError): | |
| 193 | - # Client disconnected prematurely - common with scanners | |
| 194 | - pass | |
| 195 | - except UnicodeDecodeError: | |
| 196 | - # Binary data received - not HTTP | |
| 197 | - pass | |
| 198 | - except Exception as e: | |
| 199 | - # Log unexpected errors but don't crash | |
| 200 | - logging.debug(f"Request handling error: {e}") | |
| 201 | - | |
| 202 | - def log_message(self, format, *args): | |
| 203 | - """Suppress logging for malformed requests from scanners.""" | |
| 204 | - message = format % args | |
| 205 | - # Filter out scanner noise | |
| 206 | - noise_patterns = [ | |
| 207 | - "code 400", | |
| 208 | - "Bad request", | |
| 209 | - "Bad request version", | |
| 210 | - "Bad HTTP/0.9 request type", | |
| 211 | - "Bad request syntax" | |
| 212 | - ] | |
| 213 | - if any(pattern in message for pattern in noise_patterns): | |
| 214 | - return | |
| 215 | - # Only log legitimate requests | |
| 216 | - if message and not message.startswith(" ") and len(message) > 10: | |
| 217 | - super().log_message(format, *args) | |
| 218 | - | |
| 219 | - def end_headers(self): | |
| 220 | - # Add CORS headers | |
| 221 | - self.send_header('Access-Control-Allow-Origin', '*') | |
| 222 | - self.send_header('Access-Control-Allow-Methods', 'GET, POST, OPTIONS') | |
| 223 | - self.send_header('Access-Control-Allow-Headers', 'Content-Type') | |
| 224 | - # Add security headers | |
| 225 | - self.send_header('X-Content-Type-Options', 'nosniff') | |
| 226 | - self.send_header('X-Frame-Options', 'DENY') | |
| 227 | - self.send_header('X-XSS-Protection', '1; mode=block') | |
| 228 | - super().end_headers() | |
| 229 | - | |
| 230 | - def do_OPTIONS(self): | |
| 231 | - """Handle OPTIONS requests.""" | |
| 232 | - try: | |
| 233 | - path = self.path.split('?')[0] | |
| 234 | - if self._is_proxy_path(path): | |
| 235 | - self.send_response(204) | |
| 236 | - self.end_headers() | |
| 237 | - return | |
| 238 | - self.send_response(200) | |
| 239 | - self.end_headers() | |
| 240 | - except Exception: | |
| 241 | - pass | |
| 242 | - | |
| 243 | -class ThreadedTCPServer(socketserver.ThreadingMixIn, socketserver.TCPServer): | |
| 244 | - """Threaded TCP server with better error handling.""" | |
| 245 | - allow_reuse_address = True | |
| 246 | - daemon_threads = True | |
| 6 | +import runpy | |
| 7 | +from pathlib import Path | |
| 247 | 8 | |
| 248 | -if __name__ == '__main__': | |
| 249 | - # Check if port is already in use | |
| 250 | - import socket | |
| 251 | - sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) | |
| 252 | - try: | |
| 253 | - sock.bind(("", PORT)) | |
| 254 | - sock.close() | |
| 255 | - except OSError: | |
| 256 | - print(f"ERROR: Port {PORT} is already in use.") | |
| 257 | - print(f"Please stop the existing server or use a different port.") | |
| 258 | - print(f"To stop existing server: kill $(lsof -t -i:{PORT})") | |
| 259 | - sys.exit(1) | |
| 260 | - | |
| 261 | - # Create threaded server for better concurrency | |
| 262 | - with ThreadedTCPServer(("", PORT), MyHTTPRequestHandler) as httpd: | |
| 263 | - print(f"Frontend server started at http://localhost:{PORT}") | |
| 264 | - print(f"Serving files from: {os.getcwd()}") | |
| 265 | - print("\nPress Ctrl+C to stop the server") | |
| 266 | 9 | |
| 267 | - try: | |
| 268 | - httpd.serve_forever() | |
| 269 | - except KeyboardInterrupt: | |
| 270 | - print("\nShutting down server...") | |
| 271 | - httpd.shutdown() | |
| 272 | - print("Server stopped") | |
| 273 | - sys.exit(0) | |
| 274 | - except Exception as e: | |
| 275 | - print(f"Server error: {e}") | |
| 276 | - sys.exit(1) | |
| 10 | +if __name__ == "__main__": | |
| 11 | + target = Path(__file__).resolve().parent / "frontend" / "frontend_server.py" | |
| 12 | + runpy.run_path(str(target), run_name="__main__") | ... | ... |
scripts/check_data_source.py renamed to scripts/inspect/check_data_source.py
| ... | ... | @@ -14,8 +14,8 @@ import argparse |
| 14 | 14 | from pathlib import Path |
| 15 | 15 | from sqlalchemy import create_engine, text |
| 16 | 16 | |
| 17 | -# Add parent directory to path | |
| 18 | -sys.path.insert(0, str(Path(__file__).parent.parent)) | |
| 17 | +# Add repo root to path | |
| 18 | +sys.path.insert(0, str(Path(__file__).resolve().parents[2])) | |
| 19 | 19 | |
| 20 | 20 | from utils.db_connector import create_db_connection |
| 21 | 21 | |
| ... | ... | @@ -298,4 +298,3 @@ def main(): |
| 298 | 298 | |
| 299 | 299 | if __name__ == '__main__': |
| 300 | 300 | sys.exit(main()) |
| 301 | - | ... | ... |
scripts/check_es_data.py renamed to scripts/inspect/check_es_data.py
| ... | ... | @@ -8,7 +8,7 @@ import os |
| 8 | 8 | import argparse |
| 9 | 9 | from pathlib import Path |
| 10 | 10 | |
| 11 | -sys.path.insert(0, str(Path(__file__).parent.parent)) | |
| 11 | +sys.path.insert(0, str(Path(__file__).resolve().parents[2])) | |
| 12 | 12 | |
| 13 | 13 | from utils.es_client import ESClient |
| 14 | 14 | |
| ... | ... | @@ -265,4 +265,3 @@ def main(): |
| 265 | 265 | |
| 266 | 266 | if __name__ == '__main__': |
| 267 | 267 | sys.exit(main()) |
| 268 | - | ... | ... |
scripts/check_index_mapping.py renamed to scripts/inspect/check_index_mapping.py
| ... | ... | @@ -8,7 +8,7 @@ import sys |
| 8 | 8 | import json |
| 9 | 9 | from pathlib import Path |
| 10 | 10 | |
| 11 | -sys.path.insert(0, str(Path(__file__).parent.parent)) | |
| 11 | +sys.path.insert(0, str(Path(__file__).resolve().parents[2])) | |
| 12 | 12 | |
| 13 | 13 | from utils.es_client import get_es_client_from_env |
| 14 | 14 | from indexer.mapping_generator import get_tenant_index_name | ... | ... |