feat(eval): 多评估集统一方案落地，扩展至771条query并启动LLM标注

【方案落地】 - 配置层：在 config/config.yaml 中注册 core_queries（原53条）和 clothing_top771（771条）核心改动：config/schema.py (line 410) 增加 EvaluationDataset 模型； config/loader.py (line 304) 提供 get_dataset/list_datasets，兼容旧配置；新增 scripts/evaluation/eval_framework/datasets.py 作为 dataset registry 辅助模块 - 存储与框架：所有 artifact 按 dataset_id 隔离，标注缓存跨数据集共享核心改动：store.py (line 1) 增加 dataset_id 字段到 build_runs/batch_runs； framework.py (line 1) build/batch_evaluate 接受 dataset_id 并固化 snapshot - CLI 与调参：所有子命令增加 --dataset-id 参数核心改动：cli.py (line 1)、tune_fusion.py (line 1) 及启动脚本 - Web 与前端：支持动态切换评估集，History 按 dataset 过滤核心改动：web_app.py (line 1) 新增 /api/datasets，/api/history 支持 dataset_id； static/index.html 和 eval_web.js (line 1) 增加下拉选择器【验证与测试】 - 新增 tests/test_search_evaluation_datasets.py，pytest 通过 2 passed - 编译检查通过（pyflakes/mypy 核心模块） - eval-web 已按新模型重启并通过健康检查（后续因资源占用不稳定，不影响标注）【LLM 标注运行状态】 - 目标 dataset：clothing_top771（771条query） - 手动拉起 reranker（因 search.rerank.enabled=false），确认 /health 正常 - 执行 rebuild --dataset-id clothing_top771，当前已进入第1个 query "白色oversized T-shirt" 的批量标注阶段（llm_batch=24/40） - 日志：logs/eval.log（主进度），logs/verbose/eval_verbose.log（详细 LLM I/O）

feat(eval): 多评估集统一方案落地，扩展至771条query并启动LLM标注
【方案落地】 - 配置层：在 config/config.yaml 中注册 core_queries（原53条）和 clothing_top771（771条）核心改动：config/schema.py (line 410) 增加 EvaluationDataset 模型； config/loader.py (line 304) 提供 get_dataset/list_datasets，兼容旧配置；新增 scripts/evaluation/eval_framework/datasets.py 作为 dataset registry 辅助模块 - 存储与框架：所有 artifact 按 dataset_id 隔离，标注缓存跨数据集共享核心改动：store.py (line 1) 增加 dataset_id 字段到 build_runs/batch_runs； framework.py (line 1) build/batch_evaluate 接受 dataset_id 并固化 snapshot - CLI 与调参：所有子命令增加 --dataset-id 参数核心改动：cli.py (line 1)、tune_fusion.py (line 1) 及启动脚本 - Web 与前端：支持动态切换评估集，History 按 dataset 过滤核心改动：web_app.py (line 1) 新增 /api/datasets，/api/history 支持 dataset_id； static/index.html 和 eval_web.js (line 1) 增加下拉选择器【验证与测试】 - 新增 tests/test_search_evaluation_datasets.py，pytest 通过 2 passed - 编译检查通过（pyflakes/mypy 核心模块） - eval-web 已按新模型重启并通过健康检查（后续因资源占用不稳定，不影响标注）【LLM 标注运行状态】 - 目标 dataset：clothing_top771（771条query） - 手动拉起 reranker（因 search.rerank.enabled=false），确认 /health 正常 - 执行 rebuild --dataset-id clothing_top771，当前已进入第1个 query "白色oversized T-shirt" 的批量标注阶段（llm_batch=24/40） - 日志：logs/eval.log（主进度），logs/verbose/eval_verbose.log（详细 LLM I/O）
tangwang
1 parent 2eb281bf
Showing 31 changed files with 3596 additions and 118 deletions Show diff stats
artifacts/search_evaluation/build_launches/clothing_top771_rebuild_20260417T090610Z.cmd
artifacts/search_evaluation/build_launches/clothing_top771_rebuild_20260417T090610Z.pid
config/config.yaml
config/loader.py
config/schema.py
docs/issues/issue-2026-04-16-bayes寻参-TODO.md
indexer/mapping_generator.py
scripts/evaluation/README.md
scripts/evaluation/eval_framework/__init__.py
scripts/evaluation/eval_framework/api_models.py
scripts/evaluation/eval_framework/cli.py
scripts/evaluation/eval_framework/datasets.py
scripts/evaluation/eval_framework/framework.py
scripts/evaluation/eval_framework/reports.py
scripts/evaluation/eval_framework/static/eval_web.js
scripts/evaluation/eval_framework/static/index.html
scripts/evaluation/eval_framework/store.py
scripts/evaluation/eval_framework/web_app.py
scripts/evaluation/queries/all_keywords.txt.top1w.shuf.top1k.clothing_filtered
scripts/evaluation/queries/not_clothing.txt
@@ -0,0 +1 @@
+./.venv/bin/python scripts/evaluation/build_annotation_set.py build --dataset-id clothing_top771 --tenant-id 163 --search-depth 500 --rerank-depth 10000 --reset-artifacts --force-refresh-rerank --force-refresh-labels --language en 
@@ -0,0 +1 @@
+3792200
@@ -48,6 +48,22 @@ product_enrich:
 search_evaluation:
   artifact_root: artifacts/search_evaluation
   queries_file: scripts/evaluation/queries/queries.txt
+  default_dataset_id: core_queries
+  datasets:
+  - dataset_id: core_queries
+    display_name: Core Queries
+    description: Legacy baseline evaluation set from queries.txt
+    query_file: scripts/evaluation/queries/queries.txt
+    tenant_id: '163'
+    language: en
+    enabled: true
+  - dataset_id: clothing_top771
+    display_name: Clothing Filtered 771
+    description: 771 clothing / shoes / accessories queries filtered from top1k
+    query_file: scripts/evaluation/queries/all_keywords.txt.top1w.shuf.top1k.clothing_filtered
+    tenant_id: '163'
+    language: en
+    enabled: true
   eval_log_dir: logs
   default_tenant_id: '163'
   search_base_url: ''
@@ -651,4 +667,4 @@ tenant_config:
       primary_language: en
       index_languages:
       - en
-      - zh
 \ No newline at end of file
+      - zh
@@ -47,6 +47,7 @@ from config.schema import (
     RuntimeConfig,
     SearchConfig,
     SearchEvaluationConfig,
+    SearchEvaluationDatasetConfig,
     SecretsConfig,
     ServicesConfig,
     SPUConfig,
@@ -350,11 +351,66 @@ class AppConfigLoader:
         else:
             search_base_url = str(raw_search_url).strip()
+        default_tenant_id = _str("default_tenant_id", "163")
+        default_language = _str("default_language", "en")
+        datasets_raw = se.get("datasets")
+        datasets: List[SearchEvaluationDatasetConfig] = []
+        if isinstance(datasets_raw, list):
+            for idx, item in enumerate(datasets_raw):
+                if not isinstance(item, dict):
+                    raise ConfigurationError(
+                        f"search_evaluation.datasets[{idx}] must be a mapping, got {type(item).__name__}"
+                    )
+                dataset_id = str(item.get("dataset_id") or "").strip()
+                if not dataset_id:
+                    raise ConfigurationError(f"search_evaluation.datasets[{idx}].dataset_id is required")
+                display_name = str(item.get("display_name") or dataset_id).strip() or dataset_id
+                description = str(item.get("description") or "").strip()
+                query_file = _project_path(item.get("query_file"), default_queries)
+                tenant_id = str(item.get("tenant_id") or default_tenant_id).strip() or default_tenant_id
+                language = str(item.get("language") or default_language).strip() or default_language
+                enabled = bool(item.get("enabled", True))
+                datasets.append(
+                    SearchEvaluationDatasetConfig(
+                        dataset_id=dataset_id,
+                        display_name=display_name,
+                        description=description,
+                        query_file=query_file,
+                        tenant_id=tenant_id,
+                        language=language,
+                        enabled=enabled,
+                    )
+                )
+        if not datasets:
+            datasets = [
+                SearchEvaluationDatasetConfig(
+                    dataset_id="core_queries",
+                    display_name="Core Queries",
+                    description="Legacy evaluation query set",
+                    query_file=_project_path(se.get("queries_file"), default_queries),
+                    tenant_id=default_tenant_id,
+                    language=default_language,
+                    enabled=True,
+                )
+            ]
+        default_dataset_id = str(se.get("default_dataset_id") or "").strip() or datasets[0].dataset_id
+        dataset_ids = {item.dataset_id for item in datasets}
+        if default_dataset_id not in dataset_ids:
+            raise ConfigurationError(
+                f"search_evaluation.default_dataset_id={default_dataset_id!r} is not present in search_evaluation.datasets"
+            )
+        legacy_queries_file = next(
+            (item.query_file for item in datasets if item.dataset_id == default_dataset_id),
+            datasets[0].query_file,
+        )
+
         return SearchEvaluationConfig(
             artifact_root=_project_path(se.get("artifact_root"), default_artifact),
-            queries_file=_project_path(se.get("queries_file"), default_queries),
+            queries_file=legacy_queries_file,
+            default_dataset_id=default_dataset_id,
+            datasets=tuple(datasets),
             eval_log_dir=_project_path(se.get("eval_log_dir"), default_log_dir),
-            default_tenant_id=_str("default_tenant_id", "163"),
+            default_tenant_id=default_tenant_id,
             search_base_url=search_base_url,
             web_host=_str("web_host", "0.0.0.0"),
             web_port=_int("web_port", 6010),
@@ -372,7 +428,7 @@ class AppConfigLoader:
             batch_top_k=_int("batch_top_k", 100),
             audit_top_k=_int("audit_top_k", 100),
             audit_limit_suspicious=_int("audit_limit_suspicious", 5),
-            default_language=_str("default_language", "en"),
+            default_language=default_language,
             search_recall_top_k=_int("search_recall_top_k", 200),
             rerank_high_threshold=_float("rerank_high_threshold", 0.5),
             rerank_high_skip_count=_int("rerank_high_skip_count", 1000),
@@ -408,11 +408,26 @@ class AssetsConfig:
 @dataclass(frozen=True)
+class SearchEvaluationDatasetConfig:
+    """Named query-set definition for the search evaluation framework."""
+
+    dataset_id: str
+    display_name: str
+    description: str
+    query_file: Path
+    tenant_id: str
+    language: str
+    enabled: bool = True
+
+
+@dataclass(frozen=True)
 class SearchEvaluationConfig:
     """Offline / web UI search evaluation (YAML: ``search_evaluation``)."""
     artifact_root: Path
     queries_file: Path
+    default_dataset_id: str
+    datasets: Tuple[SearchEvaluationDatasetConfig, ...]
     eval_log_dir: Path
     default_tenant_id: str
     search_base_url: str
@@ -6,26 +6,366 @@
-一、扩展评估标注集
+0、得到all_keywords.txt.top1w.shuf.top1k.clothing_filtered（done）
-参考当前的评估框架
-@scripts/evaluation/README.md @scripts/evaluation/eval_framework/framework.py 
-@start_eval.sh.sh
-当前，是基于54个评测样本（queries.txt），建立了自动化评估的系统，便于发现策略在这个评估集上的效果。
+方法1（目前这么做的）：
+用awk，读取not_clothing.txt作为set，对all_keywords.txt.top1w.shuf.top1k每一行，如果该行在set中，则过滤，得到过滤后的文件，生成文件：all_keywords.txt.top1w.shuf.top1k.clothing_filtered
-我需要扩大评估样本，将样本扩大到1k条，文件是scripts/evaluation/queries/all_keywords.txt.top1w.shuf.top1k
-但是这个文件还混杂了一些非“服饰鞋帽”类搜索词，请先做一遍清理。
+方法2：
+scripts/evaluation/queries/all_keywords.txt.top1w.shuf.top1k
+这个文件还混杂了一些非“服饰鞋帽”类搜索词，请先做一遍清理。
 用llm做剔出，每次输入50条，提示词是：
 Please filter out the queries from the following list that do not belong to the clothing, shoes, and accessories category. Output the original list of queries, one query per line, without any additional content.
 然后将返回的，从原始query剔出。
 生成文件：all_keywords.txt.top1w.shuf.top1k.clothing_filtered
-然后以all_keywords.txt.top1w.shuf.top1k.clothing_filtered为query集合，走标注流程，从而新建一个标注集。
+
+
+一、扩展评估标注集
+
+参考当前的评估框架
+@scripts/evaluation/README.md @scripts/evaluation/eval_framework/framework.py 
+@start_eval.sh.sh
+当前，是基于54个评测样本（queries.txt），建立了自动化评估的系统，便于发现策略在这个评估集上的效果。
+
+我需要扩大评估样本，使用all_keywords.txt.top1w.shuf.top1k.clothing_filtered（771条）为query集合，走标注流程，从而新建一个标注集。
 那么以后eval-web服务，现在的Batch Evaluation按钮，应该支持多个评估集合，左侧的History，也有对应多个评估集合的评估结果，请你考虑如何支持、如何设计。请进行统一的设计，不要补丁式的支持。
+统一设计方案（2026-04-17）
+
+先校正一下现状口径：
+
+- `scripts/evaluation/queries/queries.txt` 当前仓库里是 53 条非空 query，不是 54 条。
+- `scripts/evaluation/queries/all_keywords.txt.top1w.shuf.top1k.clothing_filtered` 当前是 771 条。
+
+当前实现的问题，不只是 UI 没有下拉框，而是“评估集”这个概念在系统里还不是一等公民：
+
+- 配置层只有一个全局 `search_evaluation.queries_file`
+- Web UI 左侧 Queries/History 默认只服务这一份 query 文件
+- `batch_runs` / `build_runs` 历史记录没有 `dataset_id`
+- 产物目录是全局平铺的 `batch_reports/`、`query_builds/`
+- `start_eval.sh` / `start_eval_web.sh` / `tune_fusion.py` 都是通过 `queries_file` 隐式指定评估集
+- `--reset-artifacts` 现在会清空整套 SQLite + query_builds，多评估集后这个语义会变得危险
+
+所以这里要做的，不是“给 batch API 多传一个文件路径”，而是把“评估集”抽成贯穿配置、存储、API、UI、产物、调参脚本的一层统一模型。
+
+设计目标
+
+1. 一个 eval-web 服务同时支持多个评估集。
+2. Batch Evaluation、History、调参任务都必须明确绑定某个评估集。
+3. 历史结果必须可追溯到“当时到底用了哪一批 query”，不能因为 query 文件后续变更而失真。
+4. 相同 `(tenant_id, query, spu_id)` 的标签尽量复用，不因为 query 同时出现在两个评估集里就重复标注。
+5. 扩展到第三个、第四个评估集时，不需要再改表结构思路或前端交互模型。
+
+核心抽象：区分“评估集”与“标签缓存”
+
+- 评估集（Evaluation Dataset）：一组有稳定 `dataset_id` 的 query 集合，用来驱动 build、batch、history、调参。
+- 标签缓存（Label Cache）：对 `(tenant_id, query_text, spu_id)` 的相关性判断结果。
+
+这两者不要混为一谈。
+
+建议保留现有 `relevance_labels` / `rerank_scores` 的“按 query 共享缓存”设计，不按 dataset 拆表，原因：
+
+- 同一个 query 如果同时属于 `core_queries` 和 `clothing_top771`，其 `(query, spu_id)` 标签语义本质相同，应该复用。
+- 这样新增大评估集时，只需要补齐新 query 的标签，不会对已有 query 重复做 LLM 标注。
+- 真正需要 dataset 维度的是：运行历史、构建历史、覆盖率统计、产物归档、UI 选择上下文。
+
+配置设计
+
+把当前单一 `queries_file` 升级为“评估集注册表”。建议在 `config.yaml` 中变成：
+
+```yaml
+search_evaluation:
+  artifact_root: artifacts/search_evaluation
+  default_dataset_id: core_queries
+  datasets:
+    - dataset_id: core_queries
+      display_name: Core Queries
+      description: Legacy baseline query set from queries.txt
+      query_file: scripts/evaluation/queries/queries.txt
+      tenant_id: "163"
+      language: en
+      enabled: true
+    - dataset_id: clothing_top771
+      display_name: Clothing Filtered 771
+      description: 771 filtered clothing/shoes/accessories queries
+      query_file: scripts/evaluation/queries/all_keywords.txt.top1w.shuf.top1k.clothing_filtered
+      tenant_id: "163"
+      language: en
+      enabled: true
+
+  # 保留这些作为全局默认值；dataset 没显式覆盖时继承
+  batch_top_k: 100
+  audit_top_k: 100
+  build_search_depth: 1000
+  build_rerank_depth: 10000
+```
+
+建议点：
+
+- `dataset_id` 是稳定主键，前后端、SQLite、历史记录、调参脚本都只认它，不认文件路径。
+- `query_file` 只是这个 dataset 当前版本的来源，不是外部协议的一部分。
+- 继续保留全局默认参数；以后如果某个 dataset 需要特殊 top_k / language，再支持局部覆盖。
+- 为兼容老脚本，可暂时保留 `queries_file`，但只作为 fallback，在 loader 里自动转换成一个隐式 dataset；新代码不再直接依赖它。
+
+产物目录设计
+
+当前所有 batch 报告都平铺在 `artifacts/search_evaluation/batch_reports/` 下，后面 dataset 一多会很乱。建议改成“共享缓存 + dataset 独立产物目录”：
+
+```text
+artifacts/search_evaluation/
+  search_eval.sqlite3                  # 共享标签缓存/共享 rerank 缓存/运行索引
+  datasets/
+    core_queries/
+      batch_reports/
+        <batch_id>/
+          report.json
+          report.md
+          config_snapshot.json
+          dataset_snapshot.json
+          queries.txt
+      query_builds/
+        <run_id>.json
+      audits/
+        ...
+    clothing_top771/
+      batch_reports/
+        <batch_id>/
+          ...
+      query_builds/
+        <run_id>.json
+      audits/
+        ...
+```
+
+重点是每次 batch/build 都要固化 dataset snapshot：
+
+- `dataset_id`
+- `display_name`
+- `query_file`
+- `query_count`
+- `query_sha1`
+- 当次实际 queries 副本 `queries.txt`
+
+这样即使以后 `all_keywords...clothing_filtered` 文件被重新清洗、条数变化，历史 batch 仍然可复现“当时到底评了哪些 query”。
+
+SQLite / 存储层设计
+
+共享缓存表可以继续保留：
+
+- `relevance_labels(tenant_id, query_text, spu_id, ...)`
+- `rerank_scores(tenant_id, query_text, spu_id, ...)`
+- `query_profiles(tenant_id, query_text, prompt_version, ...)`
+
+需要升级的是运行历史表：
+
+1. `build_runs` 增加
+   - `dataset_id`
+   - `dataset_display_name`
+   - `dataset_query_file`
+   - `dataset_query_count`
+   - `dataset_query_sha1`
+
+2. `batch_runs` 增加
+   - `dataset_id`
+   - `dataset_display_name`
+   - `dataset_query_file`
+   - `dataset_query_count`
+   - `dataset_query_sha1`
+
+3. `list_batch_runs()` / `get_batch_run()` / `insert_batch_run()` 全部变成 dataset-aware
+
+4. 覆盖率统计接口按 dataset 聚合，而不是简单按全库 query 聚合
+
+   - 当前 `list_query_label_stats(tenant_id)` 是“全量 query_text 分组”
+   - 以后应该是“给定 dataset_id 后，只统计该 dataset queries 的覆盖情况”
+
+这里建议不要额外把 query 全量写进 SQLite 做注册表主数据，query 主数据仍从 config + query_file 解析即可；SQLite 只负责记录 run 时的 snapshot 元数据。
+
+API 设计
+
+建议把 Web API 升级成以 dataset 为主轴，而不是默认只服务一个 `query_file`：
+
+1. `GET /api/datasets`
+
+返回所有可用评估集：
+
+- `dataset_id`
+- `display_name`
+- `description`
+- `query_count`
+- `query_file`
+- `tenant_id`
+- `language`
+- `coverage_summary`
+
+2. `GET /api/datasets/{dataset_id}/queries`
+
+返回该 dataset 的 query 列表，以及 dataset 元信息。
+
+3. `POST /api/search-eval`
+
+请求体增加可选 `dataset_id`。
+
+- 单 query 评估本身仍然可以支持任意 query 文本
+- 但当页面处于某个 dataset 上下文时，返回里也带上该 dataset 信息，便于 UI 一致展示
+
+4. `POST /api/batch-eval`
+
+请求体优先使用 `dataset_id`，不再默认依赖服务启动时绑定的唯一 `query_file`。
+
+建议请求模型变成：
+
+```json
+{
+  "dataset_id": "clothing_top771",
+  "top_k": 100,
+  "auto_annotate": false,
+  "language": "en",
+  "force_refresh_labels": false
+}
+```
+
+`queries` 字段可保留为高级/调试能力，但 UI 主路径和调参脚本主路径都应该走 `dataset_id`。
+
+5. `GET /api/history?dataset_id=clothing_top771&limit=20`
+History 默认按当前 dataset 过滤；如有需要再支持 `all=true` 看全量。
+6. `GET /api/history/{batch_id}/report`
+
+返回报告时补充 dataset 元信息，前端 report modal 里能看到这是哪个 dataset 的报告。
+
+前端 / eval-web 交互设计
+
+现在左侧栏写死了：
+
+- Queries 来自 `queries.txt`
+- History 没有 dataset 维度
+
+建议改成三层结构：
+
+1. 左上增加 Dataset Selector
+
+- 下拉框或 tabs，显示 `Core Queries (53)`、`Clothing Filtered 771 (771)`
+- 当前选中的 dataset 决定左侧 query 列表和默认 history 过滤
+
+2. Queries 区域绑定当前 dataset
+
+- 标题显示 dataset 名称 + query 数
+- 副标题显示 query 文件路径
+- 点击 query 触发单 query 评估
+
+3. History 区域绑定当前 dataset
+
+- 默认只显示当前 dataset 的 batch history
+- 每个 item 显示 `dataset badge + batch_id + created_at + query_count + primary metrics`
+- 可选再加一个 “All Datasets” 开关，但默认视角一定要是“当前 dataset”
+
+4. 主区 Batch Evaluation 按钮绑定当前 dataset
+
+- 点击时执行当前 dataset 的 batch，而不是对服务启动时唯一 query_file 执行
+- 按钮文案建议带上 dataset 名，例如：`Batch Evaluate: Clothing Filtered 771`
+
+5. 页面顶端增加当前 dataset 概览卡片
+
+- `dataset_id`
+- query 数
+- 已有标签 query 数 / 覆盖率
+- 最近一次 batch 时间
+
+这样进入页面时，用户始终知道自己正在看哪个评估集，不会把 53 条基线集和 771 条大集合的结果混在一起。
+
+CLI / 启动脚本设计
+
+需要把 `--dataset-id` 提升为第一入口参数：
+
+- `build_annotation_set.py build --dataset-id clothing_top771`
+- `build_annotation_set.py batch --dataset-id clothing_top771`
+- `build_annotation_set.py audit --dataset-id clothing_top771`
+- `serve_eval_web.py serve --dataset-id core_queries`
+
+说明：
+
+- `serve` 的 `--dataset-id` 只决定页面初始选中哪个 dataset，不应该再把整个服务绑定死到一个 query 文件。
+- `--queries-file` 可以保留一段时间做兼容，但内部先解析 registry；如果能映射到某个 dataset，就统一转成 `dataset_id` 处理。
+
+`start_eval.sh` / `start_eval_web.sh` 也要同步升级：
+
+- 读取 `REPO_EVAL_DATASET_ID`
+- 保留 `REPO_EVAL_QUERIES` 兼容模式，但新用法优先 `REPO_EVAL_DATASET_ID`
+
+额外要修正的一点：
+
+- 当前 `--reset-artifacts` 会删整个 SQLite 和整个 `query_builds/`
+- 多 dataset 后这个行为太危险
+- 应拆成更明确的选项，例如：
+  - `--reset-dataset-build-artifacts`
+  - `--purge-shared-label-cache`（显式危险操作，默认不要碰）
+
+调参框架联动设计
+
+`tune_fusion.py`、`start_coarse_fusion_tuning_long.sh`、`resume_coarse_fusion_tuning_long.sh` 也必须带 dataset 维度，否则之后同一套 coarse rank 参数可能分别在 53 条集和 771 条集上跑出完全不同的结论，但 leaderboard 会混在一起。
+
+建议：
+
+- `tune_fusion.py` 增加 `--dataset-id`
+- `summary.json` / `leaderboard.csv` / `trials.jsonl` 记录 `dataset_id`
+- 调参时调用 eval-web batch API，也传 `dataset_id`
+- `seed-report` 如果来自历史 batch 报告，也校验 `dataset_id` 一致
+
+迁移方案
+
+建议采用兼容迁移，而不是硬切：
+
+1. 先在配置中注册两个 dataset
+
+- `core_queries` -> `scripts/evaluation/queries/queries.txt`
+- `clothing_top771` -> `scripts/evaluation/queries/all_keywords.txt.top1w.shuf.top1k.clothing_filtered`
+
+2. 旧历史记录回填 dataset 元信息
+
+- 如果历史记录没有 `dataset_id`，且 query 列表 hash 与 `queries.txt` 一致，则回填为 `core_queries`
+- 无法确认的旧记录，标记为 `legacy_unknown`
+
+3. UI 默认只展示 registry 中 `enabled=true` 的 dataset
+
+4. 保留一段时间旧 CLI 参数，但 README、新脚本、新前端只文档化 dataset 模式
+
+实施顺序
+
+建议按下面顺序做，避免半途出现“后端支持了但前端看不出来”或者“前端能选但历史存不准”：
+
+1. 配置层：引入 dataset registry 与解析器
+2. 公共帮助层：统一的 dataset resolve / snapshot / artifact path helper
+3. SQLite：`batch_runs` / `build_runs` 增加 dataset 元字段
+4. Framework：`build` / `batch` / `audit` 全面改为 dataset-aware
+5. Web API：新增 `/api/datasets`，History 支持 dataset filter
+6. eval-web 前端：selector + dataset-scoped queries/history/batch
+7. 调参脚本：`--dataset-id` 全链路打通
+8. README / issue / 运维脚本更新
+
+这套设计的关键点
+
+- “评估集”是显式主键，不再靠文件路径暗示
+- “标签缓存”继续按 `(tenant_id, query, spu_id)` 共享复用
+- “历史报告”按 dataset 严格隔离并带 snapshot
+- “UI 交互”始终围绕当前 dataset 上下文展开
+- “调参结果”必须标记 dataset，防止不同集合上的指标被误比
+
+结论
+
+这件事的统一做法，不是给现有单评估集逻辑加几个 if/else，而是把 eval framework 从“单 query 文件模式”升级为“多 dataset registry 模式”。
+
+如果按这套方案落地，后面新增第三个评估集时，应该只需要：
+
+1. 在 `config.yaml` 注册一个新 dataset
+2. 跑对应 build
+3. 在 UI 中选择它做 batch / 看 history
+4. 在调参脚本里指定 `--dataset-id`
+
+而不需要再次改数据模型和交互模型。
@@ -166,4 +506,3 @@ Please filter out the queries from the following list that do not belong to the 
     '0.021', 'knn_bias': '0.0019', 'knn_exponent': '11.8477', 'knn_text_bias': '2.3125', 'knn_text_exponent': '1.1547', 'knn_image_bias': '0.9641', 'knn_image_exponent': '5.8671'}
 这一次因为外部原因（磁盘满）终止了，以上是最好的一组参数。
-
@@ -8,6 +8,7 @@ from typing import Dict, Any
 import json
 import logging
 from pathlib import Path
+import os
 from config.loader import get_app_config
@@ -30,6 +31,21 @@ def get_tenant_index_name(tenant_id: str) -&gt; str:
     其中 ES_INDEX_NAMESPACE 由 config.env_config.ES_INDEX_NAMESPACE 控制，
     用于区分 prod/uat/test 等不同运行环境。
     """
+    # Temporary override hooks (non-official, for ops/debug):
+    # - ES_INDEX_OVERRIDE_TENANT_<tenant_id>: absolute index name (without namespace auto-prefix)
+    # - ES_INDEX_OVERRIDE: absolute index name OR format string supporting "{tenant_id}"
+    #
+    # Examples:
+    #   export ES_INDEX_OVERRIDE_TENANT_163="search_products_tenant_163_backup_20260415_1438"
+    #   export ES_INDEX_OVERRIDE="search_products_tenant_{tenant_id}_backup_20260415_1438"
+    per_tenant_key = f"ES_INDEX_OVERRIDE_TENANT_{tenant_id}"
+    if (v := os.environ.get(per_tenant_key)):
+        return str(v)
+    if (v := os.environ.get("ES_INDEX_OVERRIDE")):
+        try:
+            return str(v).format(tenant_id=tenant_id)
+        except Exception:
+            return str(v)
     prefix = get_app_config().runtime.index_namespace or ""
     return f"{prefix}search_products_tenant_{tenant_id}"
@@ -2,11 +2,11 @@
 This directory holds the offline annotation builder, the evaluation web UI/API, audit tooling, and the fusion-tuning runner for retrieval quality.
-**Design:** Build labels offline for a fixed query set (`queries/queries.txt`). Single-query and batch evaluation map recalled `spu_id` values to the SQLite cache. Items without cached labels are scored as `Irrelevant`, and the UI/API surfaces tips when judged coverage is incomplete. Evaluation now uses a graded four-tier relevance system with a multi-metric primary scorecard instead of a single headline metric.
+**Design:** Build labels offline for one or more named evaluation datasets. Each dataset has a stable `dataset_id` backed by a query file registered in `config.yaml -> search_evaluation.datasets`. Single-query and batch evaluation map recalled `spu_id` values to the shared SQLite cache. Items without cached labels are scored as `Irrelevant`, and the UI/API surfaces tips when judged coverage is incomplete. Evaluation now uses a graded four-tier relevance system with a multi-metric primary scorecard instead of a single headline metric.
 ## What it does
-1. Build an annotation set for a fixed query set.
+1. Build an annotation set for a named evaluation dataset.
 2. Evaluate live search results against cached labels.
 3. Run batch evaluation and keep historical reports with config snapshots.
 4. Tune fusion parameters in a reproducible loop.
@@ -21,19 +21,23 @@ This directory holds the offline annotation builder, the evaluation web UI/API, 
 | `tune_fusion.py` | Applies config variants, restarts backend, runs batch eval, stores experiment reports |
 | `fusion_experiments_shortlist.json` | Compact experiment set for tuning |
 | `fusion_experiments_round1.json` | Broader first-round experiments |
-| `queries/queries.txt` | Canonical evaluation queries |
+| `queries/queries.txt` | Legacy core query set (`dataset_id=core_queries`) |
+| `queries/all_keywords.txt.top1w.shuf.top1k.clothing_filtered` | Expanded clothing dataset (`dataset_id=clothing_top771`) |
 | `README_Requirement.md` | Product/requirements reference |
 | `start_eval.sh` | Wrapper: `batch`, `batch-rebuild` (deep `build` + `--force-refresh-labels`), or `serve` |
 | `../start_eval_web.sh` | Same as `serve` with `activate.sh`; use `./scripts/service_ctl.sh start eval-web` (default port **6010**, override with `EVAL_WEB_PORT`). `./run.sh all` includes eval-web. |
 ## Quick start (repo root)
-Set tenant if needed (`export TENANT_ID=163`). You need a live search API, DashScope when new LLM labels are required, and a running backend.
+Set tenant if needed (`export TENANT_ID=163`). To switch datasets, export `REPO_EVAL_DATASET_ID` or pass `--dataset-id`. You need a live search API, DashScope when new LLM labels are required, and a running backend.
 ```bash
 # Batch: live search for every query; only uncached (query, spu_id) pairs hit the LLM
 ./scripts/evaluation/start_eval.sh batch
+# switch to the 771-query clothing dataset
+REPO_EVAL_DATASET_ID=clothing_top771 ./scripts/evaluation/start_eval.sh batch
+
 # Deep rebuild: per-query full corpus rerank (outside search recall pool) + LLM in batches along global sort order (early stop; expensive)
 ./scripts/evaluation/start_eval.sh batch-rebuild
@@ -47,14 +51,14 @@ Explicit equivalents:
 ```bash
 ./.venv/bin/python scripts/evaluation/build_annotation_set.py batch \
   --tenant-id "${TENANT_ID:-163}" \
-  --queries-file scripts/evaluation/queries/queries.txt \
+  --dataset-id core_queries \
   --top-k 50 \
   --language en \
   --labeler-mode simple
 ./.venv/bin/python scripts/evaluation/build_annotation_set.py build \
   --tenant-id "${TENANT_ID:-163}" \
-  --queries-file scripts/evaluation/queries/queries.txt \
+  --dataset-id core_queries \
   --search-depth 500 \
   --rerank-depth 10000 \
   --force-refresh-rerank \
@@ -64,7 +68,7 @@ Explicit equivalents:
 ./.venv/bin/python scripts/evaluation/serve_eval_web.py serve \
   --tenant-id "${TENANT_ID:-163}" \
-  --queries-file scripts/evaluation/queries/queries.txt \
+  --dataset-id core_queries \
   --host 127.0.0.1 \
   --port 6010
 ```
@@ -105,9 +109,9 @@ For **each** query in `queries.txt`, in order:
 Default root: `artifacts/search_evaluation/`
 - `search_eval.sqlite3` — corpus cache, rerank scores, relevance labels, query profiles, build/batch run metadata
-- `query_builds/` — per-query pooled build outputs
-- `batch_reports/` — batch JSON, Markdown, config snapshots
-- `audits/` — label-quality audit summaries
+- `datasets/<dataset_id>/query_builds/` — per-query pooled build outputs
+- `datasets/<dataset_id>/batch_reports/<batch_id>/` — batch JSON, Markdown, config snapshot, dataset snapshot, query snapshot
+- `datasets/<dataset_id>/audits/` — label-quality audit summaries
 - `tuning_runs/` — fusion experiment outputs and config snapshots
 ## Labels
@@ -168,7 +172,7 @@ The reported metrics are:
 ## Web UI
-Features: query list from `queries.txt`, single-query and batch evaluation, batch report history, grouped graded-metric cards, top recalls, missed judged useful results, and coverage tips for unlabeled hits.
+Features: dataset selector, dataset-scoped query list, single-query and batch evaluation, dataset-scoped batch report history, grouped graded-metric cards, top recalls, missed judged useful results, and coverage tips for unlabeled hits.
 ## Batch reports
@@ -24,6 +24,7 @@ from .constants import (  # noqa: E402
 from .framework import SearchEvaluationFramework  # noqa: E402
 from .store import EvalStore, QueryBuildResult  # noqa: E402
 from .cli import build_cli_parser, main  # noqa: E402
+from .datasets import EvalDatasetSnapshot, resolve_dataset  # noqa: E402
 from .web_app import create_web_app  # noqa: E402
 from .reports import render_batch_report_markdown  # noqa: E402
 from .utils import (  # noqa: E402
@@ -36,6 +37,7 @@ from .utils import (  # noqa: E402
 __all__ = [
     "DEFAULT_ARTIFACT_ROOT",
     "DEFAULT_QUERY_FILE",
+    "EvalDatasetSnapshot",
     "EvalStore",
     "PROJECT_ROOT",
     "QueryBuildResult",
@@ -51,6 +53,7 @@ __all__ = [
     "ensure_dir",
     "main",
     "render_batch_report_markdown",
+    "resolve_dataset",
     "sha1_text",
     "utc_now_iso",
     "utc_timestamp",
@@ -9,14 +9,16 @@ from pydantic import BaseModel, Field
 class SearchEvalRequest(BaseModel):
     query: str
+    dataset_id: Optional[str] = None
     top_k: int = Field(default=100, ge=1, le=500)
     auto_annotate: bool = False
-    language: str = "en"
+    language: Optional[str] = None
 class BatchEvalRequest(BaseModel):
+    dataset_id: Optional[str] = None
     queries: Optional[List[str]] = None
     top_k: int = Field(default=100, ge=1, le=500)
     auto_annotate: bool = False
-    language: str = "en"
+    language: Optional[str] = None
     force_refresh_labels: bool = False
@@ -9,6 +9,9 @@ import shutil
 from pathlib import Path
 from typing import Any, Dict
+from config.loader import get_app_config
+
+from .datasets import audits_dir, query_builds_dir, resolve_dataset
 from .framework import SearchEvaluationFramework
 from .logging_setup import setup_eval_logging
 from .utils import ensure_dir, utc_now_iso, utc_timestamp
@@ -17,23 +20,21 @@ from .web_app import create_web_app
 _cli_log = logging.getLogger("search_eval.cli")
-def _reset_build_artifacts() -> None:
-    from config.loader import get_app_config
-
+def _reset_build_artifacts(dataset_id: str) -> None:
     artifact_root = get_app_config().search_evaluation.artifact_root
     removed = []
-    db_path = artifact_root / "search_eval.sqlite3"
-    query_builds_dir = artifact_root / "query_builds"
-    if db_path.exists():
-        db_path.unlink()
-        removed.append(str(db_path))
-    if query_builds_dir.exists():
-        shutil.rmtree(query_builds_dir)
-        removed.append(str(query_builds_dir))
+    dataset_query_builds = query_builds_dir(artifact_root, dataset_id)
+    dataset_audits = audits_dir(artifact_root, dataset_id)
+    if dataset_query_builds.exists():
+        shutil.rmtree(dataset_query_builds)
+        removed.append(str(dataset_query_builds))
+    if dataset_audits.exists():
+        shutil.rmtree(dataset_audits)
+        removed.append(str(dataset_audits))
     if removed:
-        _cli_log.info("[build] reset previous rebuild artifacts: %s", ", ".join(removed))
+        _cli_log.info("[build] reset dataset artifacts for %s: %s", dataset_id, ", ".join(removed))
     else:
-        _cli_log.info("[build] no previous rebuild artifacts to reset under %s", artifact_root)
+        _cli_log.info("[build] no previous dataset artifacts to reset under %s for dataset=%s", artifact_root, dataset_id)
 def add_judge_llm_args(p: argparse.ArgumentParser) -> None:
@@ -89,9 +90,9 @@ def framework_kwargs_from_args(args: argparse.Namespace) -&gt; Dict[str, Any]:
 def _apply_search_evaluation_cli_defaults(args: argparse.Namespace) -> None:
     """Fill None CLI defaults from ``config.yaml`` ``search_evaluation`` (via ``get_app_config()``)."""
-    from config.loader import get_app_config
-
     se = get_app_config().search_evaluation
+    if getattr(args, "dataset_id", None) in (None, "") and getattr(args, "queries_file", None) in (None, ""):
+        args.dataset_id = se.default_dataset_id
     if getattr(args, "tenant_id", None) in (None, ""):
         args.tenant_id = se.default_tenant_id
     if getattr(args, "queries_file", None) in (None, ""):
@@ -144,6 +145,23 @@ def _apply_search_evaluation_cli_defaults(args: argparse.Namespace) -&gt; None:
             args.rebuild_irrelevant_stop_streak = se.rebuild_irrelevant_stop_streak
+def _resolve_dataset_from_args(args: argparse.Namespace, *, require_enabled: bool = False):
+    queries_file = getattr(args, "queries_file", None)
+    query_path = Path(str(queries_file)).resolve() if queries_file not in (None, "") else None
+    dataset = resolve_dataset(
+        dataset_id=getattr(args, "dataset_id", None),
+        query_file=query_path,
+        tenant_id=getattr(args, "tenant_id", None),
+        language=getattr(args, "language", None),
+        require_enabled=require_enabled,
+    )
+    args.dataset_id = dataset.dataset_id
+    args.queries_file = str(dataset.query_file)
+    args.tenant_id = dataset.tenant_id
+    args.language = dataset.language
+    return dataset
+
+
 def build_cli_parser() -> argparse.ArgumentParser:
     parser = argparse.ArgumentParser(description="Search evaluation annotation builder and web UI")
     sub = parser.add_subparsers(dest="command", required=True)
@@ -154,10 +172,11 @@ def build_cli_parser() -&gt; argparse.ArgumentParser:
         default=None,
         help="Tenant id (default: search_evaluation.default_tenant_id in config.yaml).",
     )
+    build.add_argument("--dataset-id", default=None, help="Named evaluation dataset id from config.yaml.")
     build.add_argument(
         "--queries-file",
         default=None,
-        help="Query list file (default: search_evaluation.queries_file).",
+        help="Legacy override for query list file. Prefer --dataset-id.",
     )
     build.add_argument(
         "--search-depth",
@@ -230,7 +249,7 @@ def build_cli_parser() -&gt; argparse.ArgumentParser:
     build.add_argument(
         "--reset-artifacts",
         action="store_true",
-        help="Delete rebuild cache/artifacts (SQLite + query_builds) before starting.",
+        help="Delete dataset-specific query_builds/audits before starting. Shared SQLite cache is preserved.",
     )
     build.add_argument("--force-refresh-rerank", action="store_true")
     build.add_argument("--force-refresh-labels", action="store_true")
@@ -239,7 +258,8 @@ def build_cli_parser() -&gt; argparse.ArgumentParser:
     batch = sub.add_parser("batch", help="Run batch evaluation against live search")
     batch.add_argument("--tenant-id", default=None, help="Default: search_evaluation.default_tenant_id.")
-    batch.add_argument("--queries-file", default=None, help="Default: search_evaluation.queries_file.")
+    batch.add_argument("--dataset-id", default=None, help="Named evaluation dataset id from config.yaml.")
+    batch.add_argument("--queries-file", default=None, help="Legacy override for query list file. Prefer --dataset-id.")
     batch.add_argument("--top-k", type=int, default=None, help="Default: search_evaluation.batch_top_k.")
     batch.add_argument("--language", default=None, help="Default: search_evaluation.default_language.")
     batch.add_argument("--force-refresh-labels", action="store_true")
@@ -248,7 +268,8 @@ def build_cli_parser() -&gt; argparse.ArgumentParser:
     audit = sub.add_parser("audit", help="Audit annotation quality for queries")
     audit.add_argument("--tenant-id", default=None, help="Default: search_evaluation.default_tenant_id.")
-    audit.add_argument("--queries-file", default=None, help="Default: search_evaluation.queries_file.")
+    audit.add_argument("--dataset-id", default=None, help="Named evaluation dataset id from config.yaml.")
+    audit.add_argument("--queries-file", default=None, help="Legacy override for query list file. Prefer --dataset-id.")
     audit.add_argument("--top-k", type=int, default=None, help="Default: search_evaluation.audit_top_k.")
     audit.add_argument("--language", default=None, help="Default: search_evaluation.default_language.")
     audit.add_argument(
@@ -263,7 +284,8 @@ def build_cli_parser() -&gt; argparse.ArgumentParser:
     serve = sub.add_parser("serve", help="Serve evaluation web UI on port 6010")
     serve.add_argument("--tenant-id", default=None, help="Default: search_evaluation.default_tenant_id.")
-    serve.add_argument("--queries-file", default=None, help="Default: search_evaluation.queries_file.")
+    serve.add_argument("--dataset-id", default=None, help="Initial evaluation dataset id from config.yaml.")
+    serve.add_argument("--queries-file", default=None, help="Legacy initial query file override. Prefer --dataset-id.")
     serve.add_argument("--host", default=None, help="Default: search_evaluation.web_host.")
     serve.add_argument("--port", type=int, default=None, help="Default: search_evaluation.web_port.")
     add_judge_llm_args(serve)
@@ -273,10 +295,11 @@ def build_cli_parser() -&gt; argparse.ArgumentParser:
 def run_build(args: argparse.Namespace) -> None:
+    dataset = _resolve_dataset_from_args(args)
     if args.reset_artifacts:
-        _reset_build_artifacts()
+        _reset_build_artifacts(dataset.dataset_id)
     framework = SearchEvaluationFramework(tenant_id=args.tenant_id, **framework_kwargs_from_args(args))
-    queries = framework.queries_from_file(Path(args.queries_file))
+    queries = list(dataset.queries)
     summary = []
     rebuild_kwargs = {}
     if args.force_refresh_labels:
@@ -297,6 +320,7 @@ def run_build(args: argparse.Namespace) -&gt; None:
         try:
             result = framework.build_query_annotation_set(
                 query=query,
+                dataset=dataset,
                 search_depth=args.search_depth,
                 rerank_depth=args.rerank_depth,
                 annotate_search_top_k=args.annotate_search_top_k,
@@ -329,17 +353,20 @@ def run_build(args: argparse.Namespace) -&gt; None:
             result.output_json_path,
         )
     out_path = ensure_dir(framework.artifact_root / "query_builds") / f"build_summary_{utc_timestamp()}.json"
+    out_path = query_builds_dir(framework.artifact_root, dataset.dataset_id) / f"build_summary_{utc_timestamp()}.json"
     out_path.write_text(json.dumps(summary, ensure_ascii=False, indent=2), encoding="utf-8")
     _cli_log.info("[done] summary=%s", out_path)
 def run_batch(args: argparse.Namespace) -> None:
+    dataset = _resolve_dataset_from_args(args, require_enabled=True)
     framework = SearchEvaluationFramework(tenant_id=args.tenant_id, **framework_kwargs_from_args(args))
-    queries = framework.queries_from_file(Path(args.queries_file))
-    _cli_log.info("[batch] queries_file=%s count=%s", args.queries_file, len(queries))
+    queries = list(dataset.queries)
+    _cli_log.info("[batch] dataset_id=%s queries_file=%s count=%s", dataset.dataset_id, args.queries_file, len(queries))
     try:
         payload = framework.batch_evaluate(
             queries=queries,
+            dataset=dataset,
             top_k=args.top_k,
             auto_annotate=True,
             language=args.language,
@@ -352,8 +379,9 @@ def run_batch(args: argparse.Namespace) -&gt; None:
 def run_audit(args: argparse.Namespace) -> None:
+    dataset = _resolve_dataset_from_args(args, require_enabled=True)
     framework = SearchEvaluationFramework(tenant_id=args.tenant_id, **framework_kwargs_from_args(args))
-    queries = framework.queries_from_file(Path(args.queries_file))
+    queries = list(dataset.queries)
     audit_items = []
     for query in queries:
         item = framework.audit_live_query(
@@ -394,27 +422,27 @@ def run_audit(args: argparse.Namespace) -&gt; None:
     summary = {
         "created_at": utc_now_iso(),
         "tenant_id": args.tenant_id,
+        "dataset": dataset.summary(),
         "top_k": args.top_k,
         "query_count": len(queries),
         "total_suspicious": sum(item["suspicious_count"] for item in audit_items),
         "queries": audit_items,
     }
-    out_path = ensure_dir(framework.artifact_root / "audits") / f"audit_{utc_timestamp()}.json"
+    out_path = audits_dir(framework.artifact_root, dataset.dataset_id) / f"audit_{utc_timestamp()}.json"
     out_path.write_text(json.dumps(summary, ensure_ascii=False, indent=2), encoding="utf-8")
     _cli_log.info("[done] audit=%s", out_path)
 def run_serve(args: argparse.Namespace) -> None:
+    dataset = _resolve_dataset_from_args(args, require_enabled=True)
     framework = SearchEvaluationFramework(tenant_id=args.tenant_id, **framework_kwargs_from_args(args))
-    app = create_web_app(framework, Path(args.queries_file))
+    app = create_web_app(framework, initial_dataset_id=dataset.dataset_id)
     import uvicorn
     uvicorn.run(app, host=args.host, port=args.port, log_level="info")
 def main() -> None:
-    from config.loader import get_app_config
-
     se = get_app_config().search_evaluation
     log_file = setup_eval_logging(se.eval_log_dir)
     parser = build_cli_parser()
@@ -0,0 +1,165 @@
+"""Evaluation dataset registry helpers and artifact path conventions."""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any, Dict, Iterable, List, Optional, Sequence
+
+from config.loader import get_app_config
+from config.schema import SearchEvaluationDatasetConfig
+
+from .utils import ensure_dir, sha1_text
+
+
+@dataclass(frozen=True)
+class EvalDatasetSnapshot:
+    """Resolved dataset metadata for one evaluation run."""
+
+    dataset_id: str
+    display_name: str
+    description: str
+    query_file: Path
+    tenant_id: str
+    language: str
+    enabled: bool
+    queries: tuple[str, ...]
+    query_count: int
+    query_sha1: str
+    source: str = "registry"
+
+    def summary(self) -> Dict[str, Any]:
+        return {
+            "dataset_id": self.dataset_id,
+            "display_name": self.display_name,
+            "description": self.description,
+            "query_file": str(self.query_file),
+            "tenant_id": self.tenant_id,
+            "language": self.language,
+            "enabled": self.enabled,
+            "query_count": self.query_count,
+            "query_sha1": self.query_sha1,
+            "source": self.source,
+        }
+
+
+def read_queries_file(path: Path) -> List[str]:
+    return [
+        line.strip()
+        for line in path.read_text(encoding="utf-8").splitlines()
+        if line.strip() and not line.strip().startswith("#")
+    ]
+
+
+def query_sha1(queries: Sequence[str]) -> str:
+    return sha1_text("\n".join(str(item).strip() for item in queries if str(item).strip()))
+
+
+def _enabled_datasets(datasets: Iterable[SearchEvaluationDatasetConfig]) -> List[SearchEvaluationDatasetConfig]:
+    return [item for item in datasets if item.enabled]
+
+
+def list_registered_datasets(enabled_only: bool = False) -> List[SearchEvaluationDatasetConfig]:
+    se = get_app_config().search_evaluation
+    datasets = list(se.datasets)
+    return _enabled_datasets(datasets) if enabled_only else datasets
+
+
+def resolve_registered_dataset(dataset_id: str) -> SearchEvaluationDatasetConfig:
+    for item in list_registered_datasets(enabled_only=False):
+        if item.dataset_id == dataset_id:
+            return item
+    raise KeyError(f"unknown evaluation dataset: {dataset_id}")
+
+
+def resolve_dataset(
+    *,
+    dataset_id: Optional[str] = None,
+    query_file: Optional[Path] = None,
+    tenant_id: Optional[str] = None,
+    language: Optional[str] = None,
+    require_enabled: bool = False,
+) -> EvalDatasetSnapshot:
+    se = get_app_config().search_evaluation
+    registered = list_registered_datasets(enabled_only=False)
+    selected: Optional[SearchEvaluationDatasetConfig] = None
+
+    if dataset_id:
+        selected = resolve_registered_dataset(dataset_id)
+    elif query_file is not None:
+        normalized = query_file.resolve()
+        for item in registered:
+            if item.query_file.resolve() == normalized:
+                selected = item
+                break
+    else:
+        selected = resolve_registered_dataset(se.default_dataset_id)
+
+    if selected is None:
+        path = (query_file or se.queries_file).resolve()
+        queries = tuple(read_queries_file(path))
+        derived_id = dataset_id or f"adhoc_{sha1_text(str(path))[:12]}"
+        effective_tenant = str(tenant_id or se.default_tenant_id)
+        effective_language = str(language or se.default_language)
+        return EvalDatasetSnapshot(
+            dataset_id=derived_id,
+            display_name=path.name,
+            description="Ad-hoc evaluation dataset from explicit query file",
+            query_file=path,
+            tenant_id=effective_tenant,
+            language=effective_language,
+            enabled=True,
+            queries=queries,
+            query_count=len(queries),
+            query_sha1=query_sha1(queries),
+            source="adhoc",
+        )
+
+    if require_enabled and not selected.enabled:
+        raise ValueError(f"evaluation dataset is disabled: {selected.dataset_id}")
+
+    effective_tenant = str(tenant_id or selected.tenant_id or se.default_tenant_id)
+    effective_language = str(language or selected.language or se.default_language)
+    queries = tuple(read_queries_file(selected.query_file))
+    return EvalDatasetSnapshot(
+        dataset_id=selected.dataset_id,
+        display_name=selected.display_name,
+        description=selected.description,
+        query_file=selected.query_file.resolve(),
+        tenant_id=effective_tenant,
+        language=effective_language,
+        enabled=selected.enabled,
+        queries=queries,
+        query_count=len(queries),
+        query_sha1=query_sha1(queries),
+        source="registry",
+    )
+
+
+def infer_dataset_id_from_queries(queries: Sequence[str]) -> Optional[str]:
+    target_sha = query_sha1(queries)
+    for item in list_registered_datasets(enabled_only=False):
+        snapshot = resolve_dataset(dataset_id=item.dataset_id)
+        if snapshot.query_sha1 == target_sha:
+            return snapshot.dataset_id
+    return None
+
+
+def artifact_dataset_root(artifact_root: Path, dataset_id: str) -> Path:
+    return ensure_dir(artifact_root / "datasets" / dataset_id)
+
+
+def query_builds_dir(artifact_root: Path, dataset_id: str) -> Path:
+    return ensure_dir(artifact_dataset_root(artifact_root, dataset_id) / "query_builds")
+
+
+def batch_reports_root(artifact_root: Path, dataset_id: str) -> Path:
+    return ensure_dir(artifact_dataset_root(artifact_root, dataset_id) / "batch_reports")
+
+
+def batch_report_run_dir(artifact_root: Path, dataset_id: str, batch_id: str) -> Path:
+    return ensure_dir(batch_reports_root(artifact_root, dataset_id) / batch_id)
+
+
+def audits_dir(artifact_root: Path, dataset_id: str) -> Path:
+    return ensure_dir(artifact_dataset_root(artifact_root, dataset_id) / "audits")
@@ -34,6 +34,7 @@ from .constants import (
     VALID_LABELS,
     STOP_PROB_MAP,
 )
+from .datasets import EvalDatasetSnapshot, batch_report_run_dir, query_builds_dir
 from .metrics import (
     PRIMARY_METRIC_GRADE_NORMALIZER,
     PRIMARY_METRIC_KEYS,
@@ -541,6 +542,7 @@ class SearchEvaluationFramework:
         self,
         query: str,
         *,
+        dataset: EvalDatasetSnapshot | None = None,
         search_depth: int = 1000,
         rerank_depth: int = 10000,
         annotate_search_top_k: int = 120,
@@ -571,6 +573,7 @@ class SearchEvaluationFramework:
         if force_refresh_labels:
             return self._build_query_annotation_set_rebuild(
                 query=query,
+                dataset=dataset,
                 search_depth=search_depth,
                 rerank_depth=rerank_depth,
                 language=language,
@@ -647,13 +650,16 @@ class SearchEvaluationFramework:
             for item in search_labeled_results[:100]
         ]
         metrics = compute_query_metrics(top100_labels, ideal_labels=list(labels.values()))
-        output_dir = ensure_dir(self.artifact_root / "query_builds")
+        output_dir = query_builds_dir(self.artifact_root, dataset.dataset_id) if dataset else ensure_dir(
+            self.artifact_root / "query_builds"
+        )
         run_id = f"{utc_timestamp()}_{sha1_text(self.tenant_id + '|' + query)[:10]}"
         output_json_path = output_dir / f"{run_id}.json"
         payload = {
             "run_id": run_id,
             "created_at": utc_now_iso(),
             "tenant_id": self.tenant_id,
+            "dataset": dataset.summary() if dataset else None,
             "query": query,
             "config_meta": self.search_client.get_json("/admin/config/meta", timeout=20),
             "search_total": int(search_payload.get("total") or 0),
@@ -673,7 +679,14 @@ class SearchEvaluationFramework:
             "full_rerank_top": rerank_top_results,
         }
         output_json_path.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding="utf-8")
-        self.store.insert_build_run(run_id, self.tenant_id, query, output_json_path, payload["metrics_top100"])
+        self.store.insert_build_run(
+            run_id,
+            self.tenant_id,
+            query,
+            output_json_path,
+            payload,
+            dataset=dataset,
+        )
         return QueryBuildResult(
             query=query,
             tenant_id=self.tenant_id,
@@ -688,6 +701,7 @@ class SearchEvaluationFramework:
         self,
         query: str,
         *,
+        dataset: EvalDatasetSnapshot | None,
         search_depth: int,
         rerank_depth: int,
         language: str,
@@ -857,7 +871,9 @@ class SearchEvaluationFramework:
             for item in search_labeled_results[:100]
         ]
         metrics = compute_query_metrics(top100_labels, ideal_labels=list(labels.values()))
-        output_dir = ensure_dir(self.artifact_root / "query_builds")
+        output_dir = query_builds_dir(self.artifact_root, dataset.dataset_id) if dataset else ensure_dir(
+            self.artifact_root / "query_builds"
+        )
         run_id = f"{utc_timestamp()}_{sha1_text(self.tenant_id + '|' + query)[:10]}"
         output_json_path = output_dir / f"{run_id}.json"
         pool_docs_count = len(pool_spu_ids) + len(ranked_outside)
@@ -865,6 +881,7 @@ class SearchEvaluationFramework:
             "run_id": run_id,
             "created_at": utc_now_iso(),
             "tenant_id": self.tenant_id,
+            "dataset": dataset.summary() if dataset else None,
             "query": query,
             "config_meta": self.search_client.get_json("/admin/config/meta", timeout=20),
             "search_total": int(search_payload.get("total") or 0),
@@ -883,7 +900,14 @@ class SearchEvaluationFramework:
             "full_rerank_top": rerank_top_results,
         }
         output_json_path.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding="utf-8")
-        self.store.insert_build_run(run_id, self.tenant_id, query, output_json_path, payload["metrics_top100"])
+        self.store.insert_build_run(
+            run_id,
+            self.tenant_id,
+            query,
+            output_json_path,
+            payload,
+            dataset=dataset,
+        )
         return QueryBuildResult(
             query=query,
             tenant_id=self.tenant_id,
@@ -901,6 +925,7 @@ class SearchEvaluationFramework:
         auto_annotate: bool = False,
         language: str = "en",
         force_refresh_labels: bool = False,
+        dataset: EvalDatasetSnapshot | None = None,
     ) -> Dict[str, Any]:
         search_payload = self.search_client.search(
             query=query, size=max(top_k, 100), from_=0, language=language, debug=True
@@ -997,6 +1022,7 @@ class SearchEvaluationFramework:
         return {
             "query": query,
             "tenant_id": self.tenant_id,
+            "dataset": dataset.summary() if dataset else None,
             "top_k": top_k,
             "metrics": compute_query_metrics(metric_labels, ideal_labels=ideal_labels),
             "metric_context": _metric_context_payload(),
@@ -1020,6 +1046,7 @@ class SearchEvaluationFramework:
         self,
         queries: Sequence[str],
         *,
+        dataset: EvalDatasetSnapshot | None = None,
         top_k: int = 100,
         auto_annotate: bool = True,
         language: str = "en",
@@ -1036,6 +1063,7 @@ class SearchEvaluationFramework:
                 auto_annotate=auto_annotate,
                 language=language,
                 force_refresh_labels=force_refresh_labels,
+                dataset=dataset,
             )
             labels = [
                 item["label"] if item["label"] in VALID_LABELS else RELEVANCE_LV0
@@ -1088,17 +1116,31 @@ class SearchEvaluationFramework:
             RELEVANCE_LV1: sum(item["distribution"][RELEVANCE_LV1] for item in per_query),
             RELEVANCE_LV0: sum(item["distribution"][RELEVANCE_LV0] for item in per_query),
         }
-        batch_id = f"batch_{utc_timestamp()}_{sha1_text(self.tenant_id + '|' + '|'.join(queries))[:10]}"
-        report_dir = ensure_dir(self.artifact_root / "batch_reports")
-        config_snapshot_path = report_dir / f"{batch_id}_config.json"
+        dataset_id = dataset.dataset_id if dataset else "legacy_default"
+        dataset_hash = dataset.query_sha1 if dataset else sha1_text("|".join(queries))
+        batch_id = f"batch_{utc_timestamp()}_{sha1_text(self.tenant_id + '|' + dataset_id + '|' + dataset_hash)[:10]}"
+        report_dir = batch_report_run_dir(self.artifact_root, dataset_id, batch_id) if dataset else ensure_dir(
+            self.artifact_root / "batch_reports"
+        )
+        config_snapshot_path = report_dir / "config_snapshot.json" if dataset else report_dir / f"{batch_id}_config.json"
         config_snapshot = self.search_client.get_json("/admin/config", timeout=20)
         config_snapshot_path.write_text(json.dumps(config_snapshot, ensure_ascii=False, indent=2), encoding="utf-8")
-        output_json_path = report_dir / f"{batch_id}.json"
-        report_md_path = report_dir / f"{batch_id}.md"
+        dataset_snapshot_path = report_dir / "dataset_snapshot.json" if dataset else None
+        queries_snapshot_path = report_dir / "queries.txt" if dataset else None
+        if dataset_snapshot_path is not None:
+            dataset_snapshot_path.write_text(
+                json.dumps(dataset.summary(), ensure_ascii=False, indent=2),
+                encoding="utf-8",
+            )
+        if queries_snapshot_path is not None:
+            queries_snapshot_path.write_text("\n".join(queries) + "\n", encoding="utf-8")
+        output_json_path = report_dir / "report.json" if dataset else report_dir / f"{batch_id}.json"
+        report_md_path = report_dir / "report.md" if dataset else report_dir / f"{batch_id}.md"
         payload = {
             "batch_id": batch_id,
             "created_at": utc_now_iso(),
             "tenant_id": self.tenant_id,
+            "dataset": dataset.summary() if dataset else None,
             "queries": list(queries),
             "top_k": top_k,
             "aggregate_metrics": aggregate,
@@ -1106,10 +1148,20 @@ class SearchEvaluationFramework:
             "aggregate_distribution": aggregate_distribution,
             "per_query": per_query,
             "config_snapshot_path": str(config_snapshot_path),
+            "dataset_snapshot_path": str(dataset_snapshot_path) if dataset_snapshot_path is not None else "",
+            "queries_snapshot_path": str(queries_snapshot_path) if queries_snapshot_path is not None else "",
         }
         output_json_path.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding="utf-8")
         report_md_path.write_text(render_batch_report_markdown(payload), encoding="utf-8")
-        self.store.insert_batch_run(batch_id, self.tenant_id, output_json_path, report_md_path, config_snapshot_path, payload)
+        self.store.insert_batch_run(
+            batch_id,
+            self.tenant_id,
+            output_json_path,
+            report_md_path,
+            config_snapshot_path,
+            payload,
+            dataset=dataset,
+        )
         _log.info(
             "[batch-eval] finished batch_id=%s per_query=%s json=%s",
             batch_id,
@@ -67,9 +67,22 @@ def render_batch_report_markdown(payload: Dict[str, Any]) -&gt; str:
         f"- Query count: {len(payload['queries'])}",
         f"- Top K: {payload['top_k']}",
         "",
-        "## Aggregate Metrics",
-        "",
     ]
+    dataset = payload.get("dataset") or {}
+    if dataset:
+        lines.extend(
+            [
+                "## Dataset",
+                "",
+                f"- Dataset ID: {dataset.get('dataset_id', '')}",
+                f"- Display Name: {dataset.get('display_name', '')}",
+                f"- Query File: {dataset.get('query_file', '')}",
+                f"- Query Count: {dataset.get('query_count', '')}",
+                f"- Query SHA1: {dataset.get('query_sha1', '')}",
+                "",
+            ]
+        )
+    lines.extend(["## Aggregate Metrics", ""])
     metric_context = payload.get("metric_context") or {}
     if metric_context:
         lines.extend(
@@ -4,6 +4,9 @@ async function fetchJSON(url, options) {
   return await res.json();
 }
+let _datasets = [];
+let _currentDatasetId = "";
+
 function fmtNumber(value, digits = 3) {
   if (value == null || Number.isNaN(Number(value))) return "-";
   return Number(value).toFixed(digits);
@@ -173,9 +176,18 @@ function renderTips(data) {
 }
 async function loadQueries() {
-  const data = await fetchJSON("/api/queries");
+  if (!_currentDatasetId) return;
+  const data = await fetchJSON("/api/datasets/" + encodeURIComponent(_currentDatasetId) + "/queries");
   const root = document.getElementById("queryList");
   root.innerHTML = "";
+  const dataset = data.dataset || {};
+  document.getElementById("queriesMeta").innerHTML = `Loaded from <code>${dataset.query_file || ""}</code>`;
+  document.getElementById("datasetMeta").textContent =
+    `${dataset.display_name || dataset.dataset_id || ""} · ${dataset.query_count || 0} queries`;
+  document.getElementById("pageSubtitle").textContent =
+    `Current dataset: ${dataset.display_name || dataset.dataset_id || ""}. Single-query evaluation and batch evaluation share the same service on port 6010.`;
+  document.getElementById("batchButton").textContent =
+    `Batch Evaluation: ${dataset.display_name || dataset.dataset_id || ""}`;
   data.queries.forEach((query) => {
     const btn = document.createElement("button");
     btn.className = "query-item";
@@ -188,6 +200,26 @@ async function loadQueries() {
   });
 }
+async function loadDatasets() {
+  const data = await fetchJSON("/api/datasets");
+  _datasets = data.datasets || [];
+  if (!_currentDatasetId) _currentDatasetId = data.current_dataset_id || (_datasets[0] && _datasets[0].dataset_id) || "";
+  const select = document.getElementById("datasetSelect");
+  select.innerHTML = "";
+  _datasets.forEach((dataset) => {
+    const opt = document.createElement("option");
+    opt.value = dataset.dataset_id;
+    opt.textContent = `${dataset.display_name || dataset.dataset_id} (${dataset.query_count || 0})`;
+    if (dataset.dataset_id === _currentDatasetId) opt.selected = true;
+    select.appendChild(opt);
+  });
+  select.onchange = async (ev) => {
+    _currentDatasetId = ev.target.value;
+    await loadQueries();
+    await loadHistory();
+  };
+}
+
 function historySummaryHtml(meta) {
   const m = meta && meta.aggregate_metrics;
   const nq = (meta && meta.query_count) || (meta && meta.queries && meta.queries.length) || (meta && meta.per_query && meta.per_query.length) || null;
@@ -203,7 +235,8 @@ function historySummaryHtml(meta) {
 }
 async function loadHistory() {
-  const data = await fetchJSON("/api/history");
+  if (!_currentDatasetId) return;
+  const data = await fetchJSON("/api/history?dataset_id=" + encodeURIComponent(_currentDatasetId));
   const root = document.getElementById("history");
   root.classList.remove("muted");
   const items = data.history || [];
@@ -219,8 +252,10 @@ async function loadHistory() {
     btn.className = "history-item";
     btn.setAttribute("aria-label", `Open report ${item.batch_id}`);
     const sum = historySummaryHtml(item.metadata);
+    const dataset = (item.metadata && item.metadata.dataset) || {};
+    const datasetName = dataset.display_name || dataset.dataset_id || item.dataset_id || "";
     btn.innerHTML = `<div class="hid">${item.batch_id}</div>
-      <div class="hmeta">${item.created_at} · tenant ${item.tenant_id}</div>${sum}`;
+      <div class="hmeta">${item.created_at} · tenant ${item.tenant_id}${datasetName ? ` · ${datasetName}` : ""}</div>${sum}`;
     btn.onclick = () => openBatchReport(item.batch_id);
     list.appendChild(btn);
   });
@@ -250,7 +285,10 @@ async function openBatchReport(batchId) {
   try {
     const rep = await fetchJSON("/api/history/" + encodeURIComponent(batchId) + "/report");
     _lastReportPath = rep.report_markdown_path || "";
-    metaEl.textContent = rep.report_markdown_path || "";
+    const dataset = rep.dataset || {};
+    metaEl.textContent = [dataset.display_name || dataset.dataset_id || "", rep.report_markdown_path || ""]
+      .filter(Boolean)
+      .join(" · ");
     const raw = marked.parse(rep.markdown || "", { gfm: true });
     const safe = DOMPurify.sanitize(raw, { USE_PROFILES: { html: true } });
     body.className = "report-modal-body batch-report-md";
@@ -279,11 +317,11 @@ document.getElementById(&quot;reportCopyPath&quot;).addEventListener(&quot;click&quot;, async () =&gt; 
 async function runSingle() {
   const query = document.getElementById("queryInput").value.trim();
   if (!query) return;
-  document.getElementById("status").textContent = `Evaluating "${query}"...`;
+  document.getElementById("status").textContent = `Evaluating "${query}" on ${_currentDatasetId}...`;
   const data = await fetchJSON("/api/search-eval", {
     method: "POST",
     headers: { "Content-Type": "application/json" },
-    body: JSON.stringify({ query, top_k: 100, auto_annotate: false }),
+    body: JSON.stringify({ query, dataset_id: _currentDatasetId, top_k: 100, auto_annotate: false }),
   });
   document.getElementById("status").textContent = `Done. total=${data.total}`;
   renderMetrics(data.metrics, data.metric_context);
@@ -294,19 +332,19 @@ async function runSingle() {
 }
 async function runBatch() {
-  document.getElementById("status").textContent = "Running batch evaluation...";
+  document.getElementById("status").textContent = `Running batch evaluation for ${_currentDatasetId}...`;
   const data = await fetchJSON("/api/batch-eval", {
     method: "POST",
     headers: { "Content-Type": "application/json" },
-    body: JSON.stringify({ top_k: 100, auto_annotate: false }),
+    body: JSON.stringify({ dataset_id: _currentDatasetId, top_k: 100, auto_annotate: false }),
   });
   document.getElementById("status").textContent = `Batch done. report=${data.batch_id}`;
   renderMetrics(data.aggregate_metrics, data.metric_context);
   renderResults([], "results", true);
   renderResults([], "missingRelevant", false);
-  document.getElementById("tips").innerHTML = '<div class="tip">Batch evaluation uses cached labels only unless force refresh is requested via CLI/API.</div>';
+  document.getElementById("tips").innerHTML =
+    '<div class="tip">Batch evaluation uses cached labels only unless force refresh is requested via CLI/API.</div>';
   loadHistory();
 }
-loadQueries();
-loadHistory();
+loadDatasets().then(() => loadQueries()).then(() => loadHistory());
@@ -10,8 +10,13 @@
 <body>
   <div class="app">
     <aside class="sidebar">
+      <h2>Datasets</h2>
+      <div class="section" style="padding-top:0">
+        <select id="datasetSelect" style="width:100%"></select>
+        <p id="datasetMeta" class="muted" style="font-size:12px;margin:8px 0 0"></p>
+      </div>
       <h2>Queries</h2>
-      <p class="muted">Loaded from <code>scripts/evaluation/queries/queries.txt</code></p>
+      <p id="queriesMeta" class="muted">Loading dataset queries...</p>
       <div id="queryList" class="query-list"></div>
       <div class="section">
         <h2>History</h2>
@@ -21,11 +26,11 @@
     </aside>
     <main class="main">
       <h1>Search Evaluation</h1>
-      <p class="muted">Single-query evaluation and batch evaluation share the same service on port 6010.</p>
+      <p id="pageSubtitle" class="muted">Single-query evaluation and batch evaluation share the same service on port 6010.</p>
       <div class="toolbar">
         <input id="queryInput" type="text" placeholder="Search query" />
         <button onclick="runSingle()">Evaluate Query</button>
-        <button class="secondary" onclick="runBatch()">Batch Evaluation</button>
+        <button id="batchButton" class="secondary" onclick="runBatch()">Batch Evaluation</button>
       </div>
       <div id="status" class="muted section"></div>
       <section class="section">
@@ -9,6 +9,7 @@ from pathlib import Path
 from typing import Any, Dict, List, Optional, Sequence
 from .constants import VALID_LABELS
+from .datasets import EvalDatasetSnapshot, infer_dataset_id_from_queries
 from .utils import ensure_dir, safe_json_dumps, utc_now_iso
@@ -24,10 +25,13 @@ class QueryBuildResult:
 def _compact_batch_metadata(metadata: Dict[str, Any]) -> Dict[str, Any]:
+    dataset = dict(metadata.get("dataset") or {})
     return {
         "batch_id": metadata.get("batch_id"),
         "created_at": metadata.get("created_at"),
         "tenant_id": metadata.get("tenant_id"),
+        "dataset": dataset,
+        "dataset_id": dataset.get("dataset_id") or metadata.get("dataset_id"),
         "top_k": metadata.get("top_k"),
         "query_count": len(metadata.get("queries") or []),
         "aggregate_metrics": dict(metadata.get("aggregate_metrics") or {}),
@@ -85,6 +89,11 @@ class EvalStore:
             CREATE TABLE IF NOT EXISTS build_runs (
               run_id TEXT PRIMARY KEY,
               tenant_id TEXT NOT NULL,
+              dataset_id TEXT,
+              dataset_display_name TEXT,
+              dataset_query_file TEXT,
+              dataset_query_count INTEGER,
+              dataset_query_sha1 TEXT,
               query_text TEXT NOT NULL,
               output_json_path TEXT NOT NULL,
               metadata_json TEXT NOT NULL,
@@ -94,6 +103,11 @@ class EvalStore:
             CREATE TABLE IF NOT EXISTS batch_runs (
               batch_id TEXT PRIMARY KEY,
               tenant_id TEXT NOT NULL,
+              dataset_id TEXT,
+              dataset_display_name TEXT,
+              dataset_query_file TEXT,
+              dataset_query_count INTEGER,
+              dataset_query_sha1 TEXT,
               output_json_path TEXT NOT NULL,
               report_markdown_path TEXT NOT NULL,
               config_snapshot_path TEXT NOT NULL,
@@ -113,8 +127,31 @@ class EvalStore:
             );
             """
         )
+        self._ensure_column("build_runs", "dataset_id", "TEXT")
+        self._ensure_column("build_runs", "dataset_display_name", "TEXT")
+        self._ensure_column("build_runs", "dataset_query_file", "TEXT")
+        self._ensure_column("build_runs", "dataset_query_count", "INTEGER")
+        self._ensure_column("build_runs", "dataset_query_sha1", "TEXT")
+        self._ensure_column("batch_runs", "dataset_id", "TEXT")
+        self._ensure_column("batch_runs", "dataset_display_name", "TEXT")
+        self._ensure_column("batch_runs", "dataset_query_file", "TEXT")
+        self._ensure_column("batch_runs", "dataset_query_count", "INTEGER")
+        self._ensure_column("batch_runs", "dataset_query_sha1", "TEXT")
+        self.conn.execute(
+            "CREATE INDEX IF NOT EXISTS idx_batch_runs_dataset_created ON batch_runs(dataset_id, created_at DESC)"
+        )
+        self.conn.execute(
+            "CREATE INDEX IF NOT EXISTS idx_build_runs_dataset_created ON build_runs(dataset_id, created_at DESC)"
+        )
         self.conn.commit()
+    def _ensure_column(self, table: str, column: str, column_type: str) -> None:
+        rows = self.conn.execute(f"PRAGMA table_info({table})").fetchall()
+        existing = {str(row["name"]) for row in rows}
+        if column in existing:
+            return
+        self.conn.execute(f"ALTER TABLE {table} ADD COLUMN {column} {column_type}")
+
     def upsert_corpus_docs(self, tenant_id: str, docs: Sequence[Dict[str, Any]]) -> None:
         now = utc_now_iso()
         rows = []
@@ -302,13 +339,37 @@ class EvalStore:
         )
         self.conn.commit()
-    def insert_build_run(self, run_id: str, tenant_id: str, query_text: str, output_json_path: Path, metadata: Dict[str, Any]) -> None:
+    def insert_build_run(
+        self,
+        run_id: str,
+        tenant_id: str,
+        query_text: str,
+        output_json_path: Path,
+        metadata: Dict[str, Any],
+        dataset: Optional[EvalDatasetSnapshot] = None,
+    ) -> None:
+        dataset_info = dataset.summary() if dataset is not None else dict(metadata.get("dataset") or {})
         self.conn.execute(
             """
-            INSERT OR REPLACE INTO build_runs (run_id, tenant_id, query_text, output_json_path, metadata_json, created_at)
-            VALUES (?, ?, ?, ?, ?, ?)
+            INSERT OR REPLACE INTO build_runs (
+              run_id, tenant_id, dataset_id, dataset_display_name, dataset_query_file,
+              dataset_query_count, dataset_query_sha1, query_text, output_json_path, metadata_json, created_at
+            )
+            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
             """,
-            (run_id, tenant_id, query_text, str(output_json_path), safe_json_dumps(metadata), utc_now_iso()),
+            (
+                run_id,
+                tenant_id,
+                dataset_info.get("dataset_id"),
+                dataset_info.get("display_name"),
+                dataset_info.get("query_file"),
+                dataset_info.get("query_count"),
+                dataset_info.get("query_sha1"),
+                query_text,
+                str(output_json_path),
+                safe_json_dumps(metadata),
+                utc_now_iso(),
+            ),
         )
         self.conn.commit()
@@ -320,16 +381,27 @@ class EvalStore:
         report_markdown_path: Path,
         config_snapshot_path: Path,
         metadata: Dict[str, Any],
+        dataset: Optional[EvalDatasetSnapshot] = None,
     ) -> None:
+        dataset_info = dataset.summary() if dataset is not None else dict(metadata.get("dataset") or {})
         self.conn.execute(
             """
             INSERT OR REPLACE INTO batch_runs
-            (batch_id, tenant_id, output_json_path, report_markdown_path, config_snapshot_path, metadata_json, created_at)
-            VALUES (?, ?, ?, ?, ?, ?, ?)
+            (
+              batch_id, tenant_id, dataset_id, dataset_display_name, dataset_query_file,
+              dataset_query_count, dataset_query_sha1, output_json_path, report_markdown_path,
+              config_snapshot_path, metadata_json, created_at
+            )
+            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
             """,
             (
                 batch_id,
                 tenant_id,
+                dataset_info.get("dataset_id"),
+                dataset_info.get("display_name"),
+                dataset_info.get("query_file"),
+                dataset_info.get("query_count"),
+                dataset_info.get("query_sha1"),
                 str(output_json_path),
                 str(report_markdown_path),
                 str(config_snapshot_path),
@@ -339,27 +411,59 @@ class EvalStore:
         )
         self.conn.commit()
-    def list_batch_runs(self, limit: int = 20) -> List[Dict[str, Any]]:
-        rows = self.conn.execute(
-            """
-            SELECT batch_id, tenant_id, output_json_path, report_markdown_path, config_snapshot_path, metadata_json, created_at
-            FROM batch_runs
-            ORDER BY created_at DESC
-            LIMIT ?
-            """,
-            (limit,),
-        ).fetchall()
+    def list_batch_runs(self, limit: int = 20, dataset_id: Optional[str] = None) -> List[Dict[str, Any]]:
+        if dataset_id:
+            rows = self.conn.execute(
+                """
+                SELECT batch_id, tenant_id, dataset_id, dataset_display_name, dataset_query_file, dataset_query_count,
+                       dataset_query_sha1, output_json_path, report_markdown_path, config_snapshot_path, metadata_json, created_at
+                FROM batch_runs
+                WHERE dataset_id=?
+                ORDER BY created_at DESC
+                LIMIT ?
+                """,
+                (dataset_id, limit),
+            ).fetchall()
+        else:
+            rows = self.conn.execute(
+                """
+                SELECT batch_id, tenant_id, dataset_id, dataset_display_name, dataset_query_file, dataset_query_count,
+                       dataset_query_sha1, output_json_path, report_markdown_path, config_snapshot_path, metadata_json, created_at
+                FROM batch_runs
+                ORDER BY created_at DESC
+                LIMIT ?
+                """,
+                (limit,),
+            ).fetchall()
         items: List[Dict[str, Any]] = []
         for row in rows:
             metadata = json.loads(row["metadata_json"])
+            inferred_dataset_id = row["dataset_id"] or metadata.get("dataset_id") or infer_dataset_id_from_queries(
+                metadata.get("queries") or []
+            )
+            dataset_meta = dict(metadata.get("dataset") or {})
+            if inferred_dataset_id and not dataset_meta.get("dataset_id"):
+                dataset_meta["dataset_id"] = inferred_dataset_id
+            if row["dataset_display_name"] and not dataset_meta.get("display_name"):
+                dataset_meta["display_name"] = row["dataset_display_name"]
+            if row["dataset_query_file"] and not dataset_meta.get("query_file"):
+                dataset_meta["query_file"] = row["dataset_query_file"]
+            if row["dataset_query_count"] and not dataset_meta.get("query_count"):
+                dataset_meta["query_count"] = int(row["dataset_query_count"])
+            if row["dataset_query_sha1"] and not dataset_meta.get("query_sha1"):
+                dataset_meta["query_sha1"] = row["dataset_query_sha1"]
             items.append(
                 {
                     "batch_id": row["batch_id"],
                     "tenant_id": row["tenant_id"],
+                    "dataset_id": inferred_dataset_id,
                     "output_json_path": row["output_json_path"],
                     "report_markdown_path": row["report_markdown_path"],
                     "config_snapshot_path": row["config_snapshot_path"],
-                    "metadata": _compact_batch_metadata(metadata),
+                    "metadata": {
+                        **_compact_batch_metadata(metadata),
+                        "dataset": dataset_meta,
+                    },
                     "created_at": row["created_at"],
                 }
             )
@@ -368,7 +472,8 @@ class EvalStore:
     def get_batch_run(self, batch_id: str) -> Optional[Dict[str, Any]]:
         row = self.conn.execute(
             """
-            SELECT batch_id, tenant_id, output_json_path, report_markdown_path, config_snapshot_path, metadata_json, created_at
+            SELECT batch_id, tenant_id, dataset_id, dataset_display_name, dataset_query_file, dataset_query_count,
+                   dataset_query_sha1, output_json_path, report_markdown_path, config_snapshot_path, metadata_json, created_at
             FROM batch_runs
             WHERE batch_id = ?
             """,
@@ -376,13 +481,32 @@ class EvalStore:
         ).fetchone()
         if row is None:
             return None
+        metadata = json.loads(row["metadata_json"])
+        inferred_dataset_id = row["dataset_id"] or metadata.get("dataset_id") or infer_dataset_id_from_queries(
+            metadata.get("queries") or []
+        )
+        dataset_meta = dict(metadata.get("dataset") or {})
+        if inferred_dataset_id and not dataset_meta.get("dataset_id"):
+            dataset_meta["dataset_id"] = inferred_dataset_id
+        if row["dataset_display_name"] and not dataset_meta.get("display_name"):
+            dataset_meta["display_name"] = row["dataset_display_name"]
+        if row["dataset_query_file"] and not dataset_meta.get("query_file"):
+            dataset_meta["query_file"] = row["dataset_query_file"]
+        if row["dataset_query_count"] and not dataset_meta.get("query_count"):
+            dataset_meta["query_count"] = int(row["dataset_query_count"])
+        if row["dataset_query_sha1"] and not dataset_meta.get("query_sha1"):
+            dataset_meta["query_sha1"] = row["dataset_query_sha1"]
         return {
             "batch_id": row["batch_id"],
             "tenant_id": row["tenant_id"],
+            "dataset_id": inferred_dataset_id,
             "output_json_path": row["output_json_path"],
             "report_markdown_path": row["report_markdown_path"],
             "config_snapshot_path": row["config_snapshot_path"],
-            "metadata": json.loads(row["metadata_json"]),
+            "metadata": {
+                **metadata,
+                "dataset": dataset_meta,
+            },
             "created_at": row["created_at"],
         }
@@ -11,13 +11,15 @@ from fastapi.staticfiles import StaticFiles
 from .api_models import BatchEvalRequest, SearchEvalRequest
 from .constants import DEFAULT_QUERY_FILE
+from .datasets import list_registered_datasets, resolve_dataset
 from .framework import SearchEvaluationFramework
 _STATIC_DIR = Path(__file__).resolve().parent / "static"
-def create_web_app(framework: SearchEvaluationFramework, query_file: Path = DEFAULT_QUERY_FILE) -> FastAPI:
+def create_web_app(framework: SearchEvaluationFramework, initial_dataset_id: str | None = None) -> FastAPI:
     app = FastAPI(title="Search Evaluation UI", version="1.0.0")
+    current_dataset_id = initial_dataset_id or "core_queries"
     app.mount(
         "/static",
@@ -31,35 +33,75 @@ def create_web_app(framework: SearchEvaluationFramework, query_file: Path = DEFA
     def home() -> str:
         return index_path.read_text(encoding="utf-8")
+    @app.get("/api/datasets")
+    def api_datasets() -> Dict[str, Any]:
+        stats_by_query = {item["query"]: item for item in framework.store.list_query_label_stats(framework.tenant_id)}
+        datasets = []
+        for item in list_registered_datasets(enabled_only=True):
+            snapshot = resolve_dataset(dataset_id=item.dataset_id, tenant_id=framework.tenant_id)
+            labeled_queries = sum(1 for query in snapshot.queries if (stats_by_query.get(query) or {}).get("total", 0) > 0)
+            datasets.append(
+                {
+                    **snapshot.summary(),
+                    "coverage_summary": {
+                        "labeled_queries": labeled_queries,
+                        "coverage_ratio": (labeled_queries / snapshot.query_count) if snapshot.query_count else 0.0,
+                    },
+                }
+            )
+        return {"datasets": datasets, "current_dataset_id": current_dataset_id}
+
+    @app.get("/api/datasets/{dataset_id}/queries")
+    def api_dataset_queries(dataset_id: str) -> Dict[str, Any]:
+        dataset = resolve_dataset(dataset_id=dataset_id, tenant_id=framework.tenant_id, require_enabled=True)
+        return {"dataset": dataset.summary(), "queries": list(dataset.queries)}
+
     @app.get("/api/queries")
-    def api_queries() -> Dict[str, Any]:
-        return {"queries": framework.queries_from_file(query_file)}
+    def api_queries(dataset_id: str | None = None) -> Dict[str, Any]:
+        dataset = resolve_dataset(dataset_id=dataset_id or current_dataset_id, tenant_id=framework.tenant_id)
+        return {"dataset": dataset.summary(), "queries": list(dataset.queries)}
     @app.post("/api/search-eval")
     def api_search_eval(request: SearchEvalRequest) -> Dict[str, Any]:
+        dataset = resolve_dataset(
+            dataset_id=request.dataset_id or current_dataset_id,
+            tenant_id=framework.tenant_id,
+            language=request.language,
+        )
         return framework.evaluate_live_query(
             query=request.query,
             top_k=request.top_k,
             auto_annotate=request.auto_annotate,
-            language=request.language,
+            language=dataset.language,
+            dataset=dataset,
         )
     @app.post("/api/batch-eval")
     def api_batch_eval(request: BatchEvalRequest) -> Dict[str, Any]:
-        queries = request.queries or framework.queries_from_file(query_file)
+        dataset = resolve_dataset(
+            dataset_id=request.dataset_id or current_dataset_id,
+            tenant_id=framework.tenant_id,
+            language=request.language,
+        )
+        queries = request.queries or list(dataset.queries)
         if not queries:
             raise HTTPException(status_code=400, detail="No queries provided")
         return framework.batch_evaluate(
             queries=queries,
+            dataset=dataset,
             top_k=request.top_k,
             auto_annotate=request.auto_annotate,
-            language=request.language,
+            language=dataset.language,
             force_refresh_labels=request.force_refresh_labels,
         )
     @app.get("/api/history")
-    def api_history() -> Dict[str, Any]:
-        return {"history": framework.store.list_batch_runs(limit=20)}
+    def api_history(dataset_id: str | None = None, limit: int = 20) -> Dict[str, Any]:
+        effective_dataset_id = dataset_id or current_dataset_id
+        return {
+            "history": framework.store.list_batch_runs(limit=limit, dataset_id=effective_dataset_id),
+            "dataset_id": effective_dataset_id,
+        }
     @app.get("/api/history/{batch_id}/report")
     def api_history_report(batch_id: str) -> Dict[str, Any]:
@@ -78,6 +120,7 @@ def create_web_app(framework: SearchEvaluationFramework, query_file: Path = DEFA
             "batch_id": row["batch_id"],
             "created_at": row["created_at"],
             "tenant_id": row["tenant_id"],
+            "dataset": row["metadata"].get("dataset") or {},
             "report_markdown_path": str(report_path),
             "markdown": report_path.read_text(encoding="utf-8"),
         }
@@ -0,0 +1,771 @@
+白色oversized T-shirt
+falda negra oficina
+red fitted tee
+黒いミディ丈スカート
+黑色中长半身裙
+فستان أسود متوسط الطول
+чёрное летнее платье
+修身牛仔裤
+date night dress
+vacation outfit dress
+minimalist top
+streetwear t-shirt
+office casual blouse
+波西米亚花朵衬衫
+泡泡袖短袖
+扎染字母T恤
+V-Neck Cotton T-shirt
+Athletic Gym T-shirt
+Plus Size Loose T-shirt
+Korean Style Slim T-shirt
+Basic Layering Top
+shawl collar cardigan
+swim dress
+毕业典礼定制西装
+colorblock hoodie
+sock boots
+旅行服装
+khaki green backpack
+皱边裙
+高跟鞋
+图案连身衣
+天鹅绒鸡尾酒会礼服
+gingham dress
+海滩度假装
+vacation outfits
+running shorts
+pink sweater aesthetic
+hiking boots
+宽松开襟羊毛衫
+business casual women
+a-line dress
+涤纶短裤
+Compression Top Spandex
+skiing trip insulated base layer
+high waisted jeans
+无袖夏装
+雪纺衬衫
+convertible zip-off hiking pants
+petite summer linen shorts
+tall slim fit men's linen shirt
+tall slim fit trousers
+tall straight leg pants
+tassel maxi skirt
+teacher clothesジャミロクワイ
+barbie backpack
+bandanas for women
+columbia jacket men
+halloween pjs
+salwar suit
+bolsas
+jumpsuit herren
+nike sneakers
+tunics for women
+skiunterwäsche kinder
+long jacket for women winter wear
+cape
+playmobil einhorn
+mens socks size 10-13
+wedding guest dress fall
+t shirt for men
+golf shirts for men
+barfußschuhe damen
+sweatshirts for women stylish
+toddler slippers
+silicone ring
+lululemon shorts
+hausschuhe kinder mädchen
+nba
+hazbin hotel
+alice in wonderland costume
+women's lingerie, sleep & lounge
+legami weihnachten
+blouse readymade
+portmonee herren
+womens snow pants
+tops für damen
+hangers
+snoopy gifts
+charlie kirk
+tennis skirt
+linen pants women
+dickies 874
+skibrille damen
+kurtis
+warmer for men
+tactical gear
+thermo strumpfhose damen
+hiking pants women
+forest gump
+maternity shorts
+coat
+chiffon sarees for women
+weihnachtsohrringe
+gold heels
+kulturtasche damen
+tank tops for women stylish
+gefütterte matschhose
+mens sweatpants
+graphic print tops
+crop tops for women western wear
+bandanas for men
+black skirt for women
+spongebob costume
+red tank top woman
+hoka clifton 9 womens
+sambas
+loop schal damen
+ethnic wear
+cole haan women shoes
+pyjama damen
+koffer groß
+mochila kipling
+shirt dresses for women
+shapewear for saree
+boss herren
+red beanie
+demon slayer costume
+kids halloween costumes
+puma clothing
+faultier socken
+family christmas pajamas
+traditional dress for women
+mütze
+wonder woman costume adult
+golf glove
+closed toe sandals women
+ugly sweater men
+pajama pants
+bolsa maternidade
+lingerie for women naughty
+banarasi sarees for women
+robes for women
+portemonnaie herren
+churidar set for women with dupatta
+basketball shorts men
+casual kurta set for women
+outdoor hosen für herren
+rcb jersey
+womens jean shorts
+boob tape
+gym
+shirt fan
+sprayground backpack
+twisters
+handschuhe mit heizung
+stirnband damen
+cowboy hat men
+vans shoes men
+weste damen
+old money clothes
+womens shorts casual
+new balance damen
+slim wallet for men
+red corset top
+underwear for women combo
+summer tops for seniors
+carry on luggage
+botas vaqueras para mujer
+freddy krueger sweater
+herren jeans
+calvin klein unterhosen männer
+pool bag
+toms womens shoes
+full sleeve tshirt for men
+golf accessories
+men socks
+skull mask
+jacketfor men
+heated vest women
+kostüm damen
+lululemon crossbody bag
+cap
+white tops for women
+jack
+wollsocken
+hoodie for women
+toddler snow suit
+felt
+eastpak bauchtasche
+fitness clothing
+women kurta
+mira costume kids
+camisa masculina
+black sneakers for men
+easter dresses for women 2025
+maria
+oversized shirts for women
+ballettkleidung mädchen
+shapewear petticoat for women
+beheizbare socken
+kofferset
+winter slippers for woman
+denim shirt women
+nachthemd damen langarm
+white mini dress
+hanes boxer briefs for men
+hausschuhe
+bomber jacket for man
+herren jogginghose
+u.s. polo assn.
+regenhose damen
+mens sweatshirt
+north face jacket men
+white sweater
+small backpack
+santa hats
+duffel bag
+sneaker herren
+hello kitty pajamas
+ecco herren schuhe
+angel costume for girls
+toe rings for women
+nightgowns for women
+boys easter shirt
+red sarees for women
+womens jacket
+one piece swimsuit women tummy control
+fersensporn einlagen
+skechers for women
+wintermütze herren
+socks for woman
+winter wear for men
+meerjungfrau
+kurti pant set with dupatta
+hiking shoes women
+womens fall clothes sale
+skinny fit
+costumes for adults
+green tights
+purses
+clutch purses for women
+relogio
+schürze
+papa geschenk
+airtag holder
+mardi gras beads
+women's skirts
+sheer black tights
+red kurta set for women
+bunny costume
+sunglasses
+malas e mochilas
+sweat set for women
+red top
+code set for women stylish latest
+football jersey for boys
+jogginghose damen
+flanell pyjama damen
+herren t shirt
+us polo t shirts for men
+bodysuits for women
+necessaire feminina
+wig cap
+pullover damen winter
+half sweater for man
+new balance herren
+mala de viagem 10kg
+dog costume
+shoes for man stylish
+crotchless lingerie outfits
+postpartum belly band
+sporthose herren kurz
+pride shirt
+panty for women
+kaftan kurti for women
+jogginghose herren nike
+christmas onesie adult
+period panty
+wedding guest dress
+womens dress pants
+key chain
+short kurtis for woman
+white kurta set for women
+boys water shoes
+cargo pants for women high waist
+チャンピオン パーカー
+chikankari kurta for men
+sally costume women
+mittens for women
+gay
+eastpak rucksack
+simple joys by carters
+strickjacke herren
+jorts
+womens one piece swimsuits
+batman
+church dresses for women 2025
+bra
+nike socken damen 35-38
+loafers for women
+denim top
+wärmesohlen für schuhe
+vivaia shoes for women
+louis phillips shirt for men
+sexy night dress for women honeymoon
+cap for men
+jockey women
+damen wintermantel
+thermal for men
+warme socken damen
+panty for women daily use
+long tops for woman
+golf gifts for men
+rieker winterschuhe damen
+beach wear dress for women
+kurta pajama set for men
+baniyan for man
+laufweste herren
+nursing bras
+pj sets for woman
+louis philippe shirts for men
+喪服 メンズ
+sundress
+dresses for women western wear
+white sandals
+mochila notebook
+punjabi for men
+linen pants men
+libas kurta set for women
+jack and jones jeans herren
+men underwear
+dresses for teens
+workout set
+carmesi period panties for women
+men jackets
+mütze jungen
+marco polo damen
+anarkali suit for women party wear
+freizeithose herren
+green wig
+premium brand deals
+plain sarees for women
+scarf for women stylish
+longchamp organizer insert
+アンダーアーマー パーカー
+red sweater for women
+kurti tops
+cowboy boots
+norweger pullover herren
+cupshe bathing suits for women
+reading glasses for women
+ugg boots damen
+short sleeve shirts for women
+girls snow boots
+fall pajamas
+go devil t shirt
+golf deals
+essentials hoodie
+kerala sarees for women latest design
+jeans tops for women
+steppmantel damen winter
+bombas
+jeans pant for man
+stiefel
+spring tops for women 2025
+wireless bras for women
+plus size dresses for curvy women
+tinkerbell costume for women
+tênis masculino
+panty
+sequence sarees for women
+adidas socken herren 43-46
+top for women
+racerback tank tops for women
+old lady costume for kids
+lola bunny costume
+kurta pant set for women
+woolen cap for man
+onesie
+high waisted shorts women
+newborn girl clothes
+gold heels for women
+vikings
+sweterfor women winter stylish plain black colour without button
+sweater for kids
+fascinators hats for women
+zudio
+curious george costume
+wrangler purse
+tank top with built in bra for women
+bikini damen set
+women kurta sets
+suits for women
+basketball gifts
+alien costume women
+womens sweatpants
+crocs masculino
+travel pants
+yeoreo leggings
+cotton shirts for men
+winter gloves
+period underwear
+vaude
+hausschuhe herren
+crocs feminino
+woolen cap for men
+beheizbare einlegesohlen
+relógios masculinos
+uggs kids
+fleece lined tights
+feeding dresses for women full set
+hausschuhe damen
+garment bag
+lioness
+birkenstock sandals women
+リーバイス 501
+nippies
+elsa kostüm mädchen
+viking costume men
+dirndl dresses women
+platform sandals women
+taschen damen
+pretty garden dresses
+saree
+soft silk sarees for women
+white heels
+shoes for women
+panama jack herren
+coveralls for men
+shirt for man
+pullover damen herbst
+concert outfits for women
+running shoes for women
+calvin klein
+cat costume
+shorts for kids girls
+fahrradhandschuhe damen
+botas de trabajo para hombre
+plus size winter clothes for women
+silicone rings for her
+dr scholls women shoes
+porch goose outfits
+the grinch
+green kurta set for women
+ratchet belts for men
+pajamas
+binders
+crop top for women stylish western
+gold chain for men
+turtle necks tops for women
+veirdo hoodies for men
+kette
+sweater for men winter wear
+hippie costume women
+garmin watch
+wallet
+silk sarees
+chuteira society
+knee support for men gym
+comfiest airport outfits
+leather belt for men
+nike tech
+golf gifts
+winterstiefel mädchen
+family pajamas matching sets
+vest for women
+construction vest
+snow pants men
+スプリングコート メンズ
+women sandals
+cap headbands for graduation insert
+ニューエラ パーカー
+haarspangen damen
+hand gloves for bike riding
+short dresses for women
+tween girls trendy stuff
+suit
+turtle neck t-shirt for men
+geldbörse
+leotards for girls
+hiking shoes men
+baseball bag
+passport holder for travel
+hoodies for men
+ski jacket women
+puma tshirt for man
+lehenga for women latest design
+basketball shoes
+baumwollhandschuhe
+strumpfhose mädchen
+jessie toy story costume adult
+womens underwear cotton
+floral dresses for women
+short kurti for women for jeans
+stocking stuffers for teen boys
+yoga mat for woman
+womens sun hat
+disfraz de halloween de hombre
+high heels
+trousers for men
+vampire costume men
+black tie
+spiderman hoodie zip up
+couples halloween costumes 2025
+nike sweatpants for men
+brown corset
+last day of school teacher shirt
+mens costume
+baby doll night dress sexy
+men kurta pajama set
+nose studs
+mens winter jackets
+lingerie for women sexy slutty
+vera bradley
+womens slides
+krishna dress for baby girl
+black leggings women
+satch schulrucksack jungen
+mother of bride dresses
+parx
+fall clothes
+suuksess
+engagement rings for women
+bademantel damen flauschig
+levis jeans
+red wig
+flowy pants for women
+maternity underwear
+white button down shirt women
+the north face jacke damen
+renaissance costume women
+matching pajamas for couples
+tankini deals for retired women
+formal shirts
+socks for men 9-12
+white tights
+space jam
+bodysuit
+mens pants
+shirt for men stylish
+ugg clogs
+waist beads
+peignoirs femme
+designer sarees for women party wear
+white dress shirt for men
+pullover
+mens halloween costume
+wellensteyn jacke herren
+no show socks men
+winter sneaker damen
+bordeauxfarbener hoodie
+rcb jersey 2025
+ステューシー パーカー
+vampire costume female
+boys christmas pajamas
+women hoodies for winter
+fashion accessories
+black crocs
+gloves for men
+vizzela
+men pants
+wheres waldo costume
+toddler boots
+shark onesie
+body suit
+gym gloves
+tights
+leather jacket men
+damenuhr
+chikankari kurti
+small fan
+ugg tasman
+christmas sweater
+fairy costume for girls
+skechers winterschuhe damen
+adidas spezial damen
+hand gloves
+beheizbare jacke
+summer clothes for women
+leggings
+brown heels
+rain poncho
+rompers
+renaissance costume men
+christmas earrings
+home slippers for women soft
+puma cap men
+rain boots kids
+strickkleid damen herbst
+jockey thermal wear for men
+dresses for girls
+bambus socken
+raincoat for men waterproof
+red lingerie for women
+bathing suits
+strawberry shortcake costume
+victoria
+carhartt pants for men
+tennis shoes
+indo western dress for men
+tung tung tung sahur costume
+bogg bag charms
+football socks
+compression t shirt
+house slippers
+digital watch
+sneaker damen
+tracksuit men
+unterwäsche herren
+mens halloween costumes
+women saree
+polka dot top
+anniversary gifts for men
+badelatschen herren
+adidas shoes for women
+sleeveless t shirts for men
+cross necklace for women
+nursing bras for breastfeeding
+braune strumpfhose damen
+wedding dress for women
+churidar set for women
+mens golf shorts
+feeding kurtis for women cotton
+boho dresses for women
+damensch underwear for men
+night suit for women cotton
+corduroy pants women
+adidas track suit for man
+dresses for women 2025
+cotton night suit for women
+carhartt hoodie
+jackets for men stylish latest
+levis jeans for men
+fall deals
+mesh backpack
+necessaire
+umhängetasche herren
+バドミントン ウェア
+winterhandschuhe kinder
+sully monsters inc costume
+fleece lined tights women
+アイズフロンティア 防寒着
+organza kurta set for women
+straw hat
+tabaktasche
+puma
+ready to wear sarees for women
+teacher shirts
+brille
+スカジャン
+luxury outfits for women
+winter boots for men
+uhr damen
+black lace top
+dress for women
+rumi kpop demon hunters costume
+women sweater
+puma sneaker herren
+harry potter costume kids
+whisper period panties
+merino shirt damen
+blouse for women
+mens gym shorts
+printed top
+elphaba costume
+halloween sweatshirts for women
+rieker boots damen
+arbeitstasche damen
+turning point usa shirt
+lycra track pants
+puffer vests for women
+freddy krueger costume women
+pandora
+oberteile damen
+ariat boots mens
+elmo
+kpop demon hunters backpack
+plus size costumes for women
+tommy hilfiger herren jacke
+woolen kurti for women
+funny st patricks day shirt
+100 days of school costume
+formal dresses
+bandhani saree
+knee high boots women teaieui
+skechers sandals for woman
+affenzahn rucksack
+tube tops for women with built in bra
+jack and jones
+chudidars set for women
+kids dress girls
+jack wolfskin jacke damen
+anarkali kurtis for women
+northface backpack for school
+wide calf boots for women
+halloween costumes for men
+mens t shirts with collar
+tênis feminino
+sling bag for men
+sports jacket for men
+コロンビア ダウンジャケット
+fuzzy socks
+faja body shaper
+women tank tops
+us polo tshirt for men
+chocolate brown dress
+sandalia masculina
+coach
+ブライダルインナー
+boxer briefs for men pack
+the upside
+womens t shirts
+us polo shirt
+kashmiri kurta set for women
+dress shoes for men
+korean pants for woman
+nipple covers for women
+sporttasche herren
+running shoes for men
+swarovski kette
+indo era kurta set with dupatta for women
+brown tights
+handbags
+sporttasche
+tshirts for women
+nighty for women stylish
+overalls for men
+palazzo pants for women
+sperry shoes for men
+lululemon jacket
+geschenk mädchen 9 jahre
+human hair wig
+lowa wanderschuhe herren
+clarks shoes for women
+jockey vest for man
+winter dress for women stylish
+black cardigan for women
+charlie kirk hat
+toddler water shoes
+rieker stiefeletten für damen
+golf shoes men
+presente masculino
+tenis nike para mujer
+stocking
+gabor stiefeletten damen
+uggs women
+petite dresses for women 5 ft
+cotton dress for woman
+white pant for man
+black saree party wear
+allen solly t shirts for men
+fahrradhandschuhe
+コンバース
+dr martens womens boots
+sweater for boys
+weitschaftstiefel damen
+maternity dress
+stiefel damen schwarz
+アンダーアーマー tシャツ
+coach purse
+bombas socks for women
+small crossbody bags for women
+night dress
+abendkleid
+summer outfits for women
+winterkleider damen
+straight fit jeans for women
+bolsa de viagem
+rain boots women
+korean tops for women
+bullmer
@@ -0,0 +1,371 @@
+ultrasonic jewelry cleaner
+roland kaiser
+camping ausrüstung
+transformers
+badminton
+burts bees
+barbie accessories
+gel nail polish remover
+thrive causemetics
+garmin uhr
+fathers day gift
+concealer
+pack n play
+balloonerism
+amazon outlet
+running essentials
+snoopy geschenke
+new born baby essentials
+super kitties
+canvas
+transformers age of the primes
+tea pot
+rosary
+silverette nursing cups
+n95 mask for men
+yeti camino 20
+rolex watches for men
+darts
+toddler christmas gifts
+the big bang theory
+ayliva
+motorrad zubehör
+sockenwolle
+gifts for men who have everything
+casio uhr
+fitness tracker
+weihnachtsgeschenke für frauen
+eye liner
+mini fan
+sarah connor
+yoga mat thick
+father's day
+barbies
+gifts for 2 year old girls
+funny fathers day gifts
+der grinch
+fahrzeugschein hülle
+ptomely grey
+apple watch
+dragon ball
+golf bags for men
+friday the 13th
+last of us
+mirror with lights
+borat
+lustige geschenke
+stitch adventskalender
+withings scanwatch 2
+taylor swift gifts
+ghostbusters
+best organization essentials
+action figures
+gifts for 4 year old girl
+toothpaste
+kubotan
+faultier
+capybara plush
+instant camera
+stitch sachen
+whisper ultra xl plus
+cookies
+gas mask
+mothers day gifts for daughter
+hochzeit
+aura ring
+rollschuhe
+guarda chuva
+the goonies
+pocket pussies
+stanley cup 40 oz
+digital calendar
+ぼーん
+phone stand
+pacifier
+gifts for teen boys
+sonic toys
+kitchen sink
+fourth of july deals
+joop homme
+baby essentials
+male sex toy
+supernatural
+kids watch
+retirement gifts for men
+helikon tex
+christmas gifts for grandkids
+shopping cart cover for baby
+sneaker balls
+bedroom decor
+herren uhr
+the shooting of charlie kirk
+vape
+brinquedo menina
+nascar
+cruise essentials
+shaun das schaf
+star wars lego
+geschenk für mama
+black friday angebote 2025 ab wann
+marie antoinette
+teenage boy gifts
+gabbys dollhouse figuren
+jeep wrangler accessories
+graduation gifts for her
+sg cricket kit
+shibumi beach shade
+pilates board
+vorhängeschloss mit zahlencode
+olsenbande
+weihnachtssüßigkeiten
+pilates equipment
+smart watches for women
+michael kors uhr damen
+gifts for people who love baking
+corinthians
+razor
+regenschirm
+fidget toys
+iron man helmet
+christmas wreath
+corpes bride
+portable fan
+diane keaton
+softball bag
+apple watch ultra 2
+jewelry organizers and storage
+dog man
+aperol
+canguru para bebe
+fishing lures
+miss mouths messy eater stain remover
+hydration backpack
+wärmegürtel
+golf balls
+itzy ritzy
+boba
+schwangerschaft
+window fan
+hand cream
+calculator
+twin peaks
+curb your enthusiasm
+anal plug
+scarface
+diet coke
+greys anatomy
+funny gifts
+hunting deals
+hair color for women
+labubu keychain
+geschenk frau
+gifts for people who are always cold
+back scratcher
+dinosaur
+ultraschallreiniger
+barbell
+pink room decor
+bateria cr2032
+chicken jockey
+prime deals sale
+capybara
+stocking stuffers
+boo basket stuffers for women
+dresser for bedroom
+glasses cleaner
+berserk
+summer i turned preety
+boat accessories
+cheers
+pete the cat
+american cheese
+kitchen accessories
+travel size travel products
+wall shelf
+raquete beach tennis
+insider
+nightstand
+cash box
+cotton candy
+以下是列表中**不属于服饰鞋帽类**的搜索需求：
+
+ultrasonic jewelry cleaner  
+roland kaiser  
+camping ausrüstung  
+transformers  
+badminton  
+burts bees  
+gel nail polish remover  
+thrive causemetics  
+garmin uhr  
+fathers day gift  
+concealer  
+shirt fan  
+twisters  
+pack n play  
+balloonerism  
+amazon outlet  
+golf accessories  
+running essentials  
+felt  
+new born baby essentials  
+super kitties  
+maria  
+canvas  
+transformers age of the primes  
+tea pot  
+rosary  
+silverette nursing cups  
+yeti camino 20  
+rolex watches for men  
+darts  
+toddler christmas gifts  
+the big bang theory  
+ayliva  
+fersensporn einlagen  
+motorrad zubehör  
+meerjungfrau  
+sockenwolle  
+gifts for men who have everything  
+fitness tracker  
+eye liner  
+mini fan  
+sarah connor  
+yoga mat thick  
+father's day  
+gifts for 2 year old girls  
+der grinch  
+fahrzeugschein hülle  
+ptomely grey  
+apple watch  
+key chain  
+gay  
+dragon ball  
+batman  
+friday the 13th  
+mirror with lights  
+last of us  
+borat  
+golf gifts for men  
+lustige geschenke  
+stitch adventskalender  
+withings scanwatch 2  
+taylor swift gifts  
+ghostbusters  
+best organization essentials  
+action figures  
+premium brand deals  
+toothpaste  
+kubotan  
+faultier  
+capybara plush  
+instant camera  
+golf deals  
+cookies  
+gas mask  
+mothers day gifts for daughter  
+hochzeit  
+aura ring  
+rollschuhe  
+guarda chuva  
+the goonies  
+pocket pussies  
+zudio  
+basketball gifts  
+stanley cup 40 oz  
+digital calendar  
+ぼーん  
+pacifier  
+phone stand  
+kitchen sink  
+sonic toys  
+fourth of july deals  
+male sex toy  
+supernatural  
+kids watch  
+retirement gifts for men  
+kette  
+garmin watch  
+christmas gifts for grandkids  
+sneaker balls  
+shopping cart cover for baby  
+bedroom decor  
+vape  
+brinquedo menina  
+cruise essentials  
+nascar  
+barbies  
+star wars lego  
+apple watch  
+gabbys dollhouse figuren  
+jeep wrangler accessories  
+graduation gifts for her  
+sg cricket kit  
+shibumi beach shade  
+pilates board  
+vorhängeschloss mit zahlencode  
+olsenbande  
+weihnachtssüßigkeiten  
+pilates equipment  
+fidget toys  
+iron man helmet  
+christmas wreath  
+corpes bride  
+portable fan  
+diane keaton  
+softball bag  
+aperol  
+dog man  
+fishing lures  
+miss mouths messy eater stain remover  
+tung tung tung sahur costume  
+bogg bag charms  
+anniversary gifts for men  
+golf balls  
+itzy ritzy  
+boba  
+window fan  
+rumi kpop demon hunters costume  
+hand cream  
+calculator  
+twin peaks  
+turning point usa shirt  
+curb your enthusiasm  
+pandora  
+kpop demon hunters backpack  
+anal plug  
+scarface  
+diet coke  
+greys anatomy  
+hunting deals  
+100 days of school costume  
+hair color for women  
+labubu keychain  
+back scratcher  
+dinosaur  
+ultraschallreiniger  
+barbell  
+bateria cr2032  
+pink room decor  
+chicken jockey  
+prime deals sale  
+capybara  
+stocking stuffers  
+the upside  
+boo basket stuffers for women  
+dresser for bedroom  
+glasses cleaner  
+berserk  
+summer i turned preety  
+boat accessories  
+cheers  
+human hair wig  
+pete the cat  
+american cheese  
+kitchen accessories  
+travel size travel products  
+wall shelf  
+insider  
+nightstand  
+cash box  
+cotton candy
@@ -29,6 +29,7 @@ fi
 MAX_EVALS="${MAX_EVALS:-36}"
 BATCH_SIZE="${BATCH_SIZE:-3}"
 CANDIDATE_POOL_SIZE="${CANDIDATE_POOL_SIZE:-512}"
+DATASET_ID="${REPO_EVAL_DATASET_ID:-core_queries}"
 LAUNCH_DIR="artifacts/search_evaluation/tuning_launches"
 mkdir -p "${LAUNCH_DIR}"
@@ -44,6 +45,7 @@ CMD=(
   --search-space "${RUN_DIR}/search_space.yaml"
   --seed-report artifacts/search_evaluation/batch_reports/batch_20260415T150754Z_00b6a8aa3d.md
   --tenant-id 163
+  --dataset-id "${DATASET_ID}"
   --queries-file scripts/evaluation/queries/queries.txt
   --top-k 100
   --language en
@@ -10,6 +10,7 @@ python scripts/evaluation/tune_fusion.py \
   --search-space scripts/evaluation/tuning/coarse_rank_fusion_space.yaml \
   --seed-report artifacts/search_evaluation/batch_reports/batch_20260415T150754Z_00b6a8aa3d.md \
   --tenant-id 163 \
+  --dataset-id "${REPO_EVAL_DATASET_ID:-core_queries}" \
   --queries-file scripts/evaluation/queries/queries.txt \
   --top-k 100 \
   --language en \
@@ -10,6 +10,7 @@ MAX_EVALS=&quot;${MAX_EVALS:-36}&quot;
 BATCH_SIZE="${BATCH_SIZE:-3}"
 CANDIDATE_POOL_SIZE="${CANDIDATE_POOL_SIZE:-512}"
 RANDOM_SEED="${RANDOM_SEED:-20260416}"
+DATASET_ID="${REPO_EVAL_DATASET_ID:-core_queries}"
 LAUNCH_DIR="artifacts/search_evaluation/tuning_launches"
 mkdir -p "${LAUNCH_DIR}"
@@ -25,6 +26,7 @@ CMD=(
   --search-space scripts/evaluation/tuning/coarse_rank_fusion_space.yaml
   --seed-report artifacts/search_evaluation/batch_reports/batch_20260415T150754Z_00b6a8aa3d.md
   --tenant-id 163
+  --dataset-id "${DATASET_ID}"
   --queries-file scripts/evaluation/queries/queries.txt
   --top-k 100
   --language en
@@ -6,6 +6,7 @@ ROOT=&quot;$(cd &quot;$(dirname &quot;$0&quot;)/../..&quot; &amp;&amp; pwd)&quot;
 cd "$ROOT"
 PY="${ROOT}/.venv/bin/python"
 TENANT_ID="${TENANT_ID:-163}"
+DATASET_ID="${REPO_EVAL_DATASET_ID:-core_queries}"
 QUERIES="${REPO_EVAL_QUERIES:-scripts/evaluation/queries/queries.txt}"
 usage() {
@@ -13,13 +14,14 @@ usage() {
   echo "  batch          — batch eval: live search every query, LLM only for missing labels (top_k=50)"
   echo "  batch-rebuild  — deep rebuild: build --force-refresh-labels (search recall pool + full-corpus rerank + batched LLM; expensive)"
   echo "  serve          — eval UI (default http://0.0.0.0:\${EVAL_WEB_PORT:-6010}/; also: ./scripts/start_eval_web.sh)"
-  echo "Env: TENANT_ID (default 163), REPO_EVAL_QUERIES, EVAL_WEB_HOST, EVAL_WEB_PORT (default 6010)"
+  echo "Env: TENANT_ID (default 163), REPO_EVAL_DATASET_ID (default core_queries), REPO_EVAL_QUERIES, EVAL_WEB_HOST, EVAL_WEB_PORT (default 6010)"
 }
 case "${1:-}" in
   batch)
     exec "$PY" scripts/evaluation/build_annotation_set.py batch \
       --tenant-id "$TENANT_ID" \
+      --dataset-id "$DATASET_ID" \
       --queries-file "$QUERIES" \
       --top-k 50 \
       --language en
@@ -27,6 +29,7 @@ case &quot;${1:-}&quot; in
   batch-rebuild)
     exec "$PY" scripts/evaluation/build_annotation_set.py build \
       --tenant-id "$TENANT_ID" \
+      --dataset-id "$DATASET_ID" \
       --queries-file "$QUERIES" \
       --search-depth 500 \
       --rerank-depth 10000 \
@@ -40,6 +43,7 @@ case &quot;${1:-}&quot; in
     EVAL_WEB_HOST="${EVAL_WEB_HOST:-0.0.0.0}"
     exec "$PY" scripts/evaluation/serve_eval_web.py serve \
       --tenant-id "$TENANT_ID" \
+      --dataset-id "$DATASET_ID" \
       --queries-file "$QUERIES" \
       --host "$EVAL_WEB_HOST" \
       --port "$EVAL_WEB_PORT"
@@ -41,6 +41,7 @@ from scripts.evaluation.eval_framework import (  # noqa: E402
     utc_now_iso,
     utc_timestamp,
 )
+from scripts.evaluation.eval_framework.datasets import resolve_dataset
 CONFIG_PATH = PROJECT_ROOT / "config" / "config.yaml"
@@ -373,6 +374,7 @@ def verify_backend_config(base_url: str, target_path: str, expected: Dict[str, A
 def run_batch_eval(
     *,
     tenant_id: str,
+    dataset_id: str | None,
     queries_file: Path,
     top_k: int,
     language: str,
@@ -384,13 +386,15 @@ def run_batch_eval(
         "batch",
         "--tenant-id",
         str(tenant_id),
-        "--queries-file",
-        str(queries_file),
         "--top-k",
         str(top_k),
         "--language",
         language,
     ]
+    if dataset_id:
+        cmd.extend(["--dataset-id", dataset_id])
+    else:
+        cmd.extend(["--queries-file", str(queries_file)])
     if force_refresh_labels:
         cmd.append("--force-refresh-labels")
     completed = subprocess.run(
@@ -406,16 +410,21 @@ def run_batch_eval(
     if not batch_ids:
         raise RuntimeError(f"failed to parse batch output: {output[-2000:]}")
     batch_id = batch_ids[-1]
-    batch_json_path = DEFAULT_ARTIFACT_ROOT / "batch_reports" / f"{batch_id}.json"
+    pattern = f"datasets/*/batch_reports/{batch_id}/report.json"
+    matches = sorted(DEFAULT_ARTIFACT_ROOT.glob(pattern))
+    batch_json_path = matches[0] if matches else (DEFAULT_ARTIFACT_ROOT / "batch_reports" / f"{batch_id}.json")
     if not batch_json_path.is_file():
         raise RuntimeError(f"batch json not found after eval: {batch_json_path}")
     payload = json.loads(batch_json_path.read_text(encoding="utf-8"))
+    report_path = batch_json_path.with_name("report.md")
+    if not report_path.is_file():
+        report_path = DEFAULT_ARTIFACT_ROOT / "batch_reports" / f"{batch_id}.md"
     return {
         "batch_id": batch_id,
         "payload": payload,
         "raw_output": output,
         "batch_json_path": str(batch_json_path),
-        "batch_report_path": str(DEFAULT_ARTIFACT_ROOT / "batch_reports" / f"{batch_id}.md"),
+        "batch_report_path": str(report_path),
     }
@@ -806,6 +815,8 @@ def render_markdown(
     run_id: str,
     created_at: str,
     tenant_id: str,
+    dataset_id: str,
+    dataset_name: str,
     query_count: int,
     top_k: int,
     metric: str,
@@ -829,6 +840,8 @@ def render_markdown(
         f"- Run ID: {run_id}",
         f"- Created at: {created_at}",
         f"- Tenant ID: {tenant_id}",
+        f"- Dataset ID: {dataset_id}",
+        f"- Dataset Name: {dataset_name}",
         f"- Query count: {query_count}",
         f"- Top K: {top_k}",
         f"- Score metric: {metric}",
@@ -941,6 +954,8 @@ def persist_run_summary(
     run_dir: Path,
     run_id: str,
     tenant_id: str,
+    dataset_id: str,
+    dataset_name: str,
     query_count: int,
     top_k: int,
     metric: str,
@@ -951,6 +966,8 @@ def persist_run_summary(
         "run_id": run_id,
         "created_at": utc_now_iso(),
         "tenant_id": tenant_id,
+        "dataset_id": dataset_id,
+        "dataset_name": dataset_name,
         "query_count": query_count,
         "top_k": top_k,
         "score_metric": metric,
@@ -965,6 +982,8 @@ def persist_run_summary(
             run_id=run_id,
             created_at=summary["created_at"],
             tenant_id=tenant_id,
+            dataset_id=dataset_id,
+            dataset_name=dataset_name,
             query_count=query_count,
             top_k=top_k,
             metric=metric,
@@ -976,8 +995,18 @@ def persist_run_summary(
 def run_experiment_mode(args: argparse.Namespace) -> None:
-    queries_file = Path(args.queries_file)
-    queries = read_queries(queries_file)
+    dataset = resolve_dataset(
+        dataset_id=getattr(args, "dataset_id", None),
+        query_file=Path(args.queries_file).resolve() if getattr(args, "queries_file", None) else None,
+        tenant_id=str(args.tenant_id),
+        language=str(args.language),
+    )
+    args.dataset_id = dataset.dataset_id
+    args.queries_file = str(dataset.query_file)
+    args.tenant_id = dataset.tenant_id
+    args.language = dataset.language
+    queries_file = dataset.query_file
+    queries = list(dataset.queries)
     base_config_text = CONFIG_PATH.read_text(encoding="utf-8")
     base_config = load_yaml(CONFIG_PATH)
     experiments = load_experiments(Path(args.experiments_file))
@@ -1012,6 +1041,7 @@ def run_experiment_mode(args: argparse.Namespace) -&gt; None:
             )
             batch_result = run_batch_eval(
                 tenant_id=args.tenant_id,
+                dataset_id=args.dataset_id,
                 queries_file=queries_file,
                 top_k=args.top_k,
                 language=args.language,
@@ -1064,6 +1094,8 @@ def run_experiment_mode(args: argparse.Namespace) -&gt; None:
         run_dir=run_dir,
         run_id=run_id,
         tenant_id=str(args.tenant_id),
+        dataset_id=str(args.dataset_id),
+        dataset_name=dataset.display_name,
         query_count=len(queries),
         top_k=args.top_k,
         metric=args.score_metric,
@@ -1075,8 +1107,18 @@ def run_experiment_mode(args: argparse.Namespace) -&gt; None:
 def run_optimize_mode(args: argparse.Namespace) -> None:
-    queries_file = Path(args.queries_file)
-    queries = read_queries(queries_file)
+    dataset = resolve_dataset(
+        dataset_id=getattr(args, "dataset_id", None),
+        query_file=Path(args.queries_file).resolve() if getattr(args, "queries_file", None) else None,
+        tenant_id=str(args.tenant_id),
+        language=str(args.language),
+    )
+    args.dataset_id = dataset.dataset_id
+    args.queries_file = str(dataset.query_file)
+    args.tenant_id = dataset.tenant_id
+    args.language = dataset.language
+    queries_file = dataset.query_file
+    queries = list(dataset.queries)
     base_config_text = CONFIG_PATH.read_text(encoding="utf-8")
     base_config = load_yaml(CONFIG_PATH)
     search_space_path = Path(args.search_space)
@@ -1101,6 +1143,11 @@ def run_optimize_mode(args: argparse.Namespace) -&gt; None:
         baseline_key = space.canonical_key(baseline_params)
         if baseline_key not in {space.canonical_key(item["params"]) for item in trials if item.get("params")}:
             payload = load_batch_payload(args.seed_report)
+            payload_dataset_id = str(((payload.get("dataset") or {}).get("dataset_id")) or "")
+            if payload_dataset_id and payload_dataset_id != str(args.dataset_id):
+                raise RuntimeError(
+                    f"seed report dataset mismatch: expected={args.dataset_id} actual={payload_dataset_id}"
+                )
             trial = {
                 "trial_id": next_trial_name(trials, "trial"),
                 "name": "seed_baseline",
@@ -1169,6 +1216,7 @@ def run_optimize_mode(args: argparse.Namespace) -&gt; None:
                     )
                     batch_result = run_batch_eval(
                         tenant_id=args.tenant_id,
+                        dataset_id=args.dataset_id,
                         queries_file=queries_file,
                         top_k=args.top_k,
                         language=args.language,
@@ -1236,6 +1284,8 @@ def run_optimize_mode(args: argparse.Namespace) -&gt; None:
                     run_dir=run_dir,
                     run_id=run_id,
                     tenant_id=str(args.tenant_id),
+                    dataset_id=str(args.dataset_id),
+                    dataset_name=dataset.display_name,
                     query_count=len(queries),
                     top_k=args.top_k,
                     metric=args.score_metric,
@@ -1268,6 +1318,8 @@ def run_optimize_mode(args: argparse.Namespace) -&gt; None:
         run_dir=run_dir,
         run_id=run_id,
         tenant_id=str(args.tenant_id),
+        dataset_id=str(args.dataset_id),
+        dataset_name=dataset.display_name,
         query_count=len(queries),
         top_k=args.top_k,
         metric=args.score_metric,
@@ -1286,6 +1338,7 @@ def build_parser() -&gt; argparse.ArgumentParser:
     )
     parser.add_argument("--mode", choices=["optimize", "experiments"], default="optimize")
     parser.add_argument("--tenant-id", default="163")
+    parser.add_argument("--dataset-id", default="core_queries")
     parser.add_argument("--queries-file", default=str(DEFAULT_QUERY_FILE))
     parser.add_argument("--top-k", type=int, default=100)
     parser.add_argument("--language", default="en")
@@ -0,0 +1,317 @@
+#!/usr/bin/env python3
+"""
+Compare coarse-ranking score components between two indices for queries that regressed
+in evaluation reports.
+
+This script answers a narrower question than field diffing:
+for the documents that matter in worse queries, did the ranking move because of
+image KNN, text KNN, lexical/text score, or coarse-window recall?
+
+Typical usage:
+  ./.venv/bin/python scripts/inspect/analyze_coarse_component_regression.py \
+    --current-report artifacts/search_evaluation/batch_reports/batch_20260417T073901Z_00b6a8aa3d.json \
+    --backup-report artifacts/search_evaluation/batch_reports/batch_20260417T074717Z_00b6a8aa3d.json \
+    --current-index search_products_tenant_163 \
+    --backup-index search_products_tenant_163_backup_20260415_1438
+"""
+
+from __future__ import annotations
+
+import argparse
+import logging
+import os
+import statistics
+import sys
+from collections import Counter
+from pathlib import Path
+from typing import Any, Dict, Iterable, List, Sequence, Tuple
+
+PROJECT_ROOT = Path(__file__).resolve().parents[2]
+if str(PROJECT_ROOT) not in sys.path:
+    sys.path.insert(0, str(PROJECT_ROOT))
+
+from config import get_app_config
+from context.request_context import create_request_context
+from query import QueryParser
+from search import Searcher
+from utils.es_client import get_es_client_from_env
+
+from scripts.inspect.analyze_eval_index_regression import _load_report
+
+
+logger = logging.getLogger("coarse_component_regression")
+
+
+def _rank_map(rows: Sequence[Dict[str, Any]]) -> Dict[str, int]:
+    return {str(row["spu_id"]): int(row["rank"]) for row in rows}
+
+
+def _collect_regressed_docs(
+    current_report: Dict[str, Any],
+    backup_report: Dict[str, Any],
+    *,
+    rank_gap_threshold: int,
+    scan_depth: int,
+) -> Dict[str, List[Dict[str, Any]]]:
+    current_per_query = {row["query"]: row for row in current_report["per_query"]}
+    backup_per_query = {row["query"]: row for row in backup_report["per_query"]}
+    grouped: Dict[str, List[Dict[str, Any]]] = {}
+    for query, current_case in current_per_query.items():
+        backup_case = backup_per_query[query]
+        delta = (
+            float(current_case["metrics"]["Primary_Metric_Score"])
+            - float(backup_case["metrics"]["Primary_Metric_Score"])
+        )
+        if delta >= 0:
+            continue
+        current_ranks = _rank_map(current_case["top_results"])
+        for row in backup_case["top_results"][:scan_depth]:
+            if row["label"] not in {"Fully Relevant", "Mostly Relevant"}:
+                continue
+            current_rank = current_ranks.get(row["spu_id"], 999)
+            if current_rank <= int(row["rank"]) + rank_gap_threshold:
+                continue
+            grouped.setdefault(query, []).append(
+                {
+                    "query": query,
+                    "delta_primary": delta,
+                    "spu_id": str(row["spu_id"]),
+                    "backup_rank_eval": int(row["rank"]),
+                    "backup_label": str(row["label"]),
+                    "current_rank_eval": current_rank,
+                }
+            )
+    return grouped
+
+
+def _build_searcher() -> Searcher:
+    config = get_app_config().search
+    es_client = get_es_client_from_env()
+    query_parser = QueryParser(config)
+    return Searcher(es_client, config, query_parser)
+
+
+def _run_query(searcher: Searcher, *, query: str, tenant_id: str, index_name: str) -> Tuple[Dict[str, Dict[str, Any]], int]:
+    os.environ[f"ES_INDEX_OVERRIDE_TENANT_{tenant_id}"] = index_name
+    ctx = create_request_context(reqid="coarsecmp", uid="-1")
+    ctx._logger = logger
+    searcher.search(
+        query=query,
+        tenant_id=tenant_id,
+        size=10,
+        context=ctx,
+        debug=True,
+        enable_rerank=False,
+        language="en",
+    )
+    rows = ctx.get_intermediate_result("coarse_rank_scores", []) or []
+    by_doc: Dict[str, Dict[str, Any]] = {}
+    for rank, row in enumerate(rows, start=1):
+        doc_id = row.get("doc_id")
+        if doc_id is None:
+            continue
+        payload = dict(row)
+        payload["_coarse_rank"] = rank
+        by_doc[str(doc_id)] = payload
+    return by_doc, len(rows)
+
+
+def _safe_float(value: Any) -> float | None:
+    try:
+        if value is None:
+            return None
+        return float(value)
+    except (TypeError, ValueError):
+        return None
+
+
+def _delta(current_value: Any, backup_value: Any) -> float | None:
+    current = _safe_float(current_value)
+    backup = _safe_float(backup_value)
+    if current is None or backup is None:
+        return None
+    return current - backup
+
+
+def _counter_key(delta_value: float | None, *, eps: float = 1e-6) -> str:
+    if delta_value is None:
+        return "missing"
+    if abs(delta_value) <= eps:
+        return "same"
+    return "lower" if delta_value < 0 else "higher"
+
+
+def _median_or_none(values: Sequence[float]) -> float | None:
+    if not values:
+        return None
+    return float(statistics.median(values))
+
+
+def _summarize_rows(comparisons: Sequence[Dict[str, Any]]) -> None:
+    both_present = [row for row in comparisons if row["current_row"] is not None and row["backup_row"] is not None]
+    backup_only = [row for row in comparisons if row["current_row"] is None and row["backup_row"] is not None]
+    current_only = [row for row in comparisons if row["current_row"] is not None and row["backup_row"] is None]
+
+    image_counter: Counter[str] = Counter()
+    text_knn_counter: Counter[str] = Counter()
+    text_counter: Counter[str] = Counter()
+    es_counter: Counter[str] = Counter()
+    coarse_counter: Counter[str] = Counter()
+
+    image_deltas: List[float] = []
+    text_knn_deltas: List[float] = []
+    text_deltas: List[float] = []
+    es_deltas: List[float] = []
+    coarse_deltas: List[float] = []
+
+    for row in both_present:
+        image_delta = _delta(row["current_row"].get("image_knn_score"), row["backup_row"].get("image_knn_score"))
+        text_knn_delta = _delta(row["current_row"].get("text_knn_score"), row["backup_row"].get("text_knn_score"))
+        text_delta = _delta(row["current_row"].get("text_score"), row["backup_row"].get("text_score"))
+        es_delta = _delta(row["current_row"].get("es_score"), row["backup_row"].get("es_score"))
+        coarse_delta = _delta(row["current_row"].get("coarse_score"), row["backup_row"].get("coarse_score"))
+
+        image_counter[_counter_key(image_delta)] += 1
+        text_knn_counter[_counter_key(text_knn_delta)] += 1
+        text_counter[_counter_key(text_delta)] += 1
+        es_counter[_counter_key(es_delta)] += 1
+        coarse_counter[_counter_key(coarse_delta)] += 1
+
+        for bucket, sink in (
+            (image_delta, image_deltas),
+            (text_knn_delta, text_knn_deltas),
+            (text_delta, text_deltas),
+            (es_delta, es_deltas),
+            (coarse_delta, coarse_deltas),
+        ):
+            if bucket is not None:
+                sink.append(bucket)
+
+    print("Coarse Component Summary")
+    print("=" * 80)
+    print(f"affected_docs: {len(comparisons)}")
+    print(f"present_in_both_coarse_windows: {len(both_present)}")
+    print(f"only_in_backup_coarse_window: {len(backup_only)}")
+    print(f"only_in_current_coarse_window: {len(current_only)}")
+    print()
+    print(f"image_knn delta buckets: {dict(image_counter)}")
+    print(f"text_knn delta buckets : {dict(text_knn_counter)}")
+    print(f"text_score delta buckets: {dict(text_counter)}")
+    print(f"es_score delta buckets  : {dict(es_counter)}")
+    print(f"coarse_score buckets    : {dict(coarse_counter)}")
+    print()
+    print(
+        "median deltas (current - backup): "
+        f"image_knn={_median_or_none(image_deltas)} | "
+        f"text_knn={_median_or_none(text_knn_deltas)} | "
+        f"text_score={_median_or_none(text_deltas)} | "
+        f"es_score={_median_or_none(es_deltas)} | "
+        f"coarse_score={_median_or_none(coarse_deltas)}"
+    )
+    print()
+
+
+def _print_query_examples(comparisons: Sequence[Dict[str, Any]], top_queries: int, docs_per_query: int) -> None:
+    grouped: Dict[str, List[Dict[str, Any]]] = {}
+    for row in comparisons:
+        grouped.setdefault(row["query"], []).append(row)
+
+    ordered_queries = sorted(
+        grouped,
+        key=lambda query: min(item["delta_primary"] for item in grouped[query]),
+    )
+
+    print(f"Detailed Examples (top {top_queries} queries)")
+    print("=" * 80)
+    for query in ordered_queries[:top_queries]:
+        rows = sorted(grouped[query], key=lambda item: item["backup_rank_eval"])
+        print(f"\n## {query}")
+        print(f"affected_docs={len(rows)} | delta_primary={rows[0]['delta_primary']:+.6f}")
+        for row in rows[:docs_per_query]:
+            current_row = row["current_row"]
+            backup_row = row["backup_row"]
+            print(
+                f"  - spu={row['spu_id']} "
+                f"eval_current={row['current_rank_eval']} eval_backup={row['backup_rank_eval']} "
+                f"coarse_current={current_row.get('_coarse_rank') if current_row else None} "
+                f"coarse_backup={backup_row.get('_coarse_rank') if backup_row else None}"
+            )
+            if current_row and backup_row:
+                print(
+                    "    image_knn "
+                    f"{backup_row.get('image_knn_score')} -> {current_row.get('image_knn_score')} | "
+                    "text_knn "
+                    f"{backup_row.get('text_knn_score')} -> {current_row.get('text_knn_score')} | "
+                    "text_score "
+                    f"{backup_row.get('text_score')} -> {current_row.get('text_score')} | "
+                    "es_score "
+                    f"{backup_row.get('es_score')} -> {current_row.get('es_score')} | "
+                    "coarse_score "
+                    f"{backup_row.get('coarse_score')} -> {current_row.get('coarse_score')}"
+                )
+            else:
+                print(
+                    f"    present_current={current_row is not None} "
+                    f"present_backup={backup_row is not None}"
+                )
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Analyze coarse-score component regressions")
+    parser.add_argument("--current-report", required=True)
+    parser.add_argument("--backup-report", required=True)
+    parser.add_argument("--current-index", required=True)
+    parser.add_argument("--backup-index", required=True)
+    parser.add_argument("--tenant-id", default="163")
+    parser.add_argument("--rank-gap-threshold", type=int, default=5)
+    parser.add_argument("--scan-depth", type=int, default=20)
+    parser.add_argument("--detail-queries", type=int, default=6)
+    parser.add_argument("--detail-docs-per-query", type=int, default=3)
+    args = parser.parse_args()
+
+    logging.basicConfig(level=logging.WARNING)
+
+    current_report = _load_report(args.current_report)
+    backup_report = _load_report(args.backup_report)
+    regressed = _collect_regressed_docs(
+        current_report=current_report,
+        backup_report=backup_report,
+        rank_gap_threshold=args.rank_gap_threshold,
+        scan_depth=args.scan_depth,
+    )
+
+    searcher = _build_searcher()
+    comparisons: List[Dict[str, Any]] = []
+
+    for query, rows in regressed.items():
+        current_by_doc, _ = _run_query(
+            searcher,
+            query=query,
+            tenant_id=args.tenant_id,
+            index_name=args.current_index,
+        )
+        backup_by_doc, _ = _run_query(
+            searcher,
+            query=query,
+            tenant_id=args.tenant_id,
+            index_name=args.backup_index,
+        )
+        for row in rows:
+            comparisons.append(
+                {
+                    **row,
+                    "current_row": current_by_doc.get(row["spu_id"]),
+                    "backup_row": backup_by_doc.get(row["spu_id"]),
+                }
+            )
+
+    _summarize_rows(comparisons)
+    _print_query_examples(
+        comparisons,
+        top_queries=args.detail_queries,
+        docs_per_query=args.detail_docs_per_query,
+    )
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,337 @@
+#!/usr/bin/env python3
+"""
+Analyze search evaluation regressions between two batch reports and trace them back
+to document field changes across two Elasticsearch indices.
+
+Typical usage:
+  ./.venv/bin/python scripts/inspect/analyze_eval_index_regression.py \
+    --current-report artifacts/search_evaluation/batch_reports/batch_20260417T073901Z_00b6a8aa3d.json \
+    --backup-report artifacts/search_evaluation/batch_reports/batch_20260417T074717Z_00b6a8aa3d.json \
+    --current-index search_products_tenant_163 \
+    --backup-index search_products_tenant_163_backup_20260415_1438
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import statistics
+import sys
+from collections import Counter
+from pathlib import Path
+from typing import Any, Dict, Iterable, List, Optional, Sequence, Tuple
+
+PROJECT_ROOT = Path(__file__).resolve().parents[2]
+if str(PROJECT_ROOT) not in sys.path:
+    sys.path.insert(0, str(PROJECT_ROOT))
+
+from utils.es_client import get_es_client_from_env
+
+
+SEARCHABLE_SOURCE_FIELDS: Sequence[str] = (
+    "title",
+    "keywords",
+    "qanchors",
+    "enriched_tags",
+    "enriched_attributes",
+    "option1_values",
+    "option2_values",
+    "option3_values",
+    "tags",
+    "category_path",
+    "category_name_text",
+)
+
+CORE_FIELDS_TO_COMPARE: Sequence[str] = (
+    "title",
+    "keywords",
+    "qanchors",
+    "enriched_tags",
+    "enriched_attributes",
+    "option1_values",
+    "option2_values",
+    "option3_values",
+    "tags",
+)
+
+STRONG_LABELS = {"Fully Relevant", "Mostly Relevant"}
+
+
+def _load_report(path: str) -> Dict[str, Any]:
+    return json.loads(Path(path).read_text())
+
+
+def _rank_map(rows: Sequence[Dict[str, Any]]) -> Dict[str, int]:
+    return {str(row["spu_id"]): int(row["rank"]) for row in rows}
+
+
+def _label_map(rows: Sequence[Dict[str, Any]]) -> Dict[str, str]:
+    return {str(row["spu_id"]): str(row["label"]) for row in rows}
+
+
+def _count_items(value: Any) -> int:
+    if isinstance(value, list):
+        return len(value)
+    if isinstance(value, str):
+        return len([x for x in value.split(",") if x.strip()])
+    return 0
+
+
+def _json_short(value: Any, max_len: int = 220) -> str:
+    payload = json.dumps(value, ensure_ascii=False, sort_keys=True)
+    if len(payload) <= max_len:
+        return payload
+    return payload[: max_len - 3] + "..."
+
+
+class SourceFetcher:
+    def __init__(self) -> None:
+        self.es = get_es_client_from_env().client
+        self._cache: Dict[Tuple[str, str], Optional[Dict[str, Any]]] = {}
+
+    def fetch(self, index_name: str, spu_id: str) -> Optional[Dict[str, Any]]:
+        key = (index_name, spu_id)
+        if key in self._cache:
+            return self._cache[key]
+        body = {
+            "size": 1,
+            "query": {"term": {"spu_id": spu_id}},
+            "_source": ["spu_id", *SEARCHABLE_SOURCE_FIELDS],
+        }
+        hits = self.es.search(index=index_name, body=body)["hits"]["hits"]
+        doc = hits[0]["_source"] if hits else None
+        self._cache[key] = doc
+        return doc
+
+
+def _changed_fields(current_doc: Dict[str, Any], backup_doc: Dict[str, Any]) -> List[str]:
+    return [field for field in CORE_FIELDS_TO_COMPARE if current_doc.get(field) != backup_doc.get(field)]
+
+
+def _iter_regressed_docs(
+    current_report: Dict[str, Any],
+    backup_report: Dict[str, Any],
+    rank_gap_threshold: int,
+    scan_depth: int,
+) -> Iterable[Dict[str, Any]]:
+    current_per_query = {row["query"]: row for row in current_report["per_query"]}
+    backup_per_query = {row["query"]: row for row in backup_report["per_query"]}
+    for query, current_case in current_per_query.items():
+        backup_case = backup_per_query[query]
+        delta = (
+            float(current_case["metrics"]["Primary_Metric_Score"])
+            - float(backup_case["metrics"]["Primary_Metric_Score"])
+        )
+        if delta >= 0:
+            continue
+        current_ranks = _rank_map(current_case["top_results"])
+        current_labels = _label_map(current_case["top_results"])
+        for row in backup_case["top_results"][:scan_depth]:
+            if row["label"] not in STRONG_LABELS:
+                continue
+            current_rank = current_ranks.get(row["spu_id"], 999)
+            if current_rank <= int(row["rank"]) + rank_gap_threshold:
+                continue
+            yield {
+                "query": query,
+                "delta_primary": delta,
+                "spu_id": str(row["spu_id"]),
+                "backup_rank": int(row["rank"]),
+                "backup_label": str(row["label"]),
+                "current_rank": current_rank,
+                "current_label": current_labels.get(row["spu_id"]),
+            }
+
+
+def _print_metric_summary(current_report: Dict[str, Any], backup_report: Dict[str, Any], top_n: int) -> None:
+    current_per_query = {row["query"]: row for row in current_report["per_query"]}
+    backup_per_query = {row["query"]: row for row in backup_report["per_query"]}
+    deltas: List[Tuple[str, float, Dict[str, Any], Dict[str, Any]]] = []
+    for query, current_case in current_per_query.items():
+        backup_case = backup_per_query[query]
+        deltas.append(
+            (
+                query,
+                float(current_case["metrics"]["Primary_Metric_Score"])
+                - float(backup_case["metrics"]["Primary_Metric_Score"]),
+                current_case,
+                backup_case,
+            )
+        )
+    worse = sum(1 for _, delta, _, _ in deltas if delta < 0)
+    better = sum(1 for _, delta, _, _ in deltas if delta > 0)
+    print("Overall Query Delta")
+    print("=" * 80)
+    print(f"worse: {worse} | better: {better} | total: {len(deltas)}")
+    print(
+        "aggregate primary:"
+        f" current={current_report['aggregate_metrics']['Primary_Metric_Score']:.6f}"
+        f" backup={backup_report['aggregate_metrics']['Primary_Metric_Score']:.6f}"
+        f" delta={current_report['aggregate_metrics']['Primary_Metric_Score'] - backup_report['aggregate_metrics']['Primary_Metric_Score']:+.6f}"
+    )
+    print()
+    print(f"Worst {top_n} Queries By Primary_Metric_Score Delta")
+    print("=" * 80)
+    for query, delta, current_case, backup_case in sorted(deltas, key=lambda x: x[1])[:top_n]:
+        print(
+            f"{delta:+.4f}\t{query}\t"
+            f"NDCG@20 {current_case['metrics']['NDCG@20'] - backup_case['metrics']['NDCG@20']:+.4f}\t"
+            f"ERR@10 {current_case['metrics']['ERR@10'] - backup_case['metrics']['ERR@10']:+.4f}\t"
+            f"SP@10 {current_case['metrics']['Strong_Precision@10'] - backup_case['metrics']['Strong_Precision@10']:+.2f}"
+        )
+    print()
+
+
+def _print_field_change_summary(
+    regressed_rows: Sequence[Dict[str, Any]],
+    fetcher: SourceFetcher,
+    current_index: str,
+    backup_index: str,
+) -> None:
+    field_counter: Counter[str] = Counter()
+    qanchor_counts_en: List[Tuple[int, int]] = []
+    qanchor_counts_zh: List[Tuple[int, int]] = []
+    tag_counts_en: List[Tuple[int, int]] = []
+    tag_counts_zh: List[Tuple[int, int]] = []
+
+    for row in regressed_rows:
+        current_doc = fetcher.fetch(current_index, row["spu_id"])
+        backup_doc = fetcher.fetch(backup_index, row["spu_id"])
+        if not current_doc or not backup_doc:
+            continue
+        for field in _changed_fields(current_doc, backup_doc):
+            field_counter[field] += 1
+
+        current_qanchors = current_doc.get("qanchors") or {}
+        backup_qanchors = backup_doc.get("qanchors") or {}
+        current_tags = current_doc.get("enriched_tags") or {}
+        backup_tags = backup_doc.get("enriched_tags") or {}
+        qanchor_counts_en.append((_count_items(current_qanchors.get("en")), _count_items(backup_qanchors.get("en"))))
+        qanchor_counts_zh.append((_count_items(current_qanchors.get("zh")), _count_items(backup_qanchors.get("zh"))))
+        tag_counts_en.append((_count_items(current_tags.get("en")), _count_items(backup_tags.get("en"))))
+        tag_counts_zh.append((_count_items(current_tags.get("zh")), _count_items(backup_tags.get("zh"))))
+
+    print("Affected Strong-Relevant Docs")
+    print("=" * 80)
+    print(f"count: {len(regressed_rows)}")
+    print("changed field frequency:")
+    for field, count in field_counter.most_common():
+        print(f"  {field}: {count}")
+    print()
+
+    def summarize_counts(name: str, pairs: Sequence[Tuple[int, int]]) -> None:
+        if not pairs:
+            return
+        current_counts = [current for current, _ in pairs]
+        backup_counts = [backup for _, backup in pairs]
+        print(
+            f"{name}: current_avg={statistics.mean(current_counts):.3f} "
+            f"backup_avg={statistics.mean(backup_counts):.3f} "
+            f"delta={statistics.mean(current - backup for current, backup in pairs):+.3f} "
+            f"backup_more={sum(1 for current, backup in pairs if backup > current)} "
+            f"current_more={sum(1 for current, backup in pairs if current > backup)}"
+        )
+
+    print("phrase/tag density on affected docs:")
+    summarize_counts("qanchors.en", qanchor_counts_en)
+    summarize_counts("qanchors.zh", qanchor_counts_zh)
+    summarize_counts("enriched_tags.en", tag_counts_en)
+    summarize_counts("enriched_tags.zh", tag_counts_zh)
+    print()
+
+
+def _print_query_details(
+    current_report: Dict[str, Any],
+    backup_report: Dict[str, Any],
+    regressed_rows: Sequence[Dict[str, Any]],
+    fetcher: SourceFetcher,
+    current_index: str,
+    backup_index: str,
+    top_queries: int,
+    max_docs_per_query: int,
+) -> None:
+    current_per_query = {row["query"]: row for row in current_report["per_query"]}
+    backup_per_query = {row["query"]: row for row in backup_report["per_query"]}
+    grouped: Dict[str, List[Dict[str, Any]]] = {}
+    for row in regressed_rows:
+        grouped.setdefault(row["query"], []).append(row)
+
+    ordered_queries = sorted(grouped, key=lambda q: current_per_query[q]["metrics"]["Primary_Metric_Score"] - backup_per_query[q]["metrics"]["Primary_Metric_Score"])
+
+    print(f"Detailed Query Samples (top {top_queries})")
+    print("=" * 80)
+    for query in ordered_queries[:top_queries]:
+        current_case = current_per_query[query]
+        backup_case = backup_per_query[query]
+        delta = current_case["metrics"]["Primary_Metric_Score"] - backup_case["metrics"]["Primary_Metric_Score"]
+        print(f"\n## {query}")
+        print(
+            f"delta_primary={delta:+.6f} | current_top10={current_case['top_label_sequence_top10']} | "
+            f"backup_top10={backup_case['top_label_sequence_top10']}"
+        )
+        for row in sorted(grouped[query], key=lambda item: item["backup_rank"])[:max_docs_per_query]:
+            current_doc = fetcher.fetch(current_index, row["spu_id"])
+            backup_doc = fetcher.fetch(backup_index, row["spu_id"])
+            if not current_doc or not backup_doc:
+                print(
+                    f"  - spu={row['spu_id']} backup_rank={row['backup_rank']} current_rank={row['current_rank']} "
+                    "(missing source)"
+                )
+                continue
+            changed = _changed_fields(current_doc, backup_doc)
+            print(
+                f"  - spu={row['spu_id']} backup_rank={row['backup_rank']} ({row['backup_label']}) "
+                f"-> current_rank={row['current_rank']} ({row['current_label']})"
+            )
+            print(f"    changed_fields: {', '.join(changed) if changed else '(none)'}")
+            for field in changed[:4]:
+                print(f"    {field}.current: {_json_short(current_doc.get(field))}")
+                print(f"    {field}.backup : {_json_short(backup_doc.get(field))}")
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Analyze eval regressions between two indices")
+    parser.add_argument("--current-report", required=True, help="Report JSON for the worse/current index")
+    parser.add_argument("--backup-report", required=True, help="Report JSON for the better/reference index")
+    parser.add_argument("--current-index", required=True, help="Current/worse index name")
+    parser.add_argument("--backup-index", required=True, help="Reference/better index name")
+    parser.add_argument("--rank-gap-threshold", type=int, default=5, help="Treat a strong-relevant doc as regressed when current rank > backup rank + this gap")
+    parser.add_argument("--scan-depth", type=int, default=20, help="Only inspect backup strong-relevant docs within this depth")
+    parser.add_argument("--top-worst-queries", type=int, default=12, help="How many worst queries to print in the metric summary")
+    parser.add_argument("--detail-queries", type=int, default=6, help="How many regressed queries to print detailed field diffs for")
+    parser.add_argument("--detail-docs-per-query", type=int, default=3, help="How many regressed docs to print per detailed query")
+    args = parser.parse_args()
+
+    current_report = _load_report(args.current_report)
+    backup_report = _load_report(args.backup_report)
+    fetcher = SourceFetcher()
+    regressed_rows = list(
+        _iter_regressed_docs(
+            current_report=current_report,
+            backup_report=backup_report,
+            rank_gap_threshold=args.rank_gap_threshold,
+            scan_depth=args.scan_depth,
+        )
+    )
+
+    _print_metric_summary(current_report, backup_report, top_n=args.top_worst_queries)
+    _print_field_change_summary(
+        regressed_rows=regressed_rows,
+        fetcher=fetcher,
+        current_index=args.current_index,
+        backup_index=args.backup_index,
+    )
+    _print_query_details(
+        current_report=current_report,
+        backup_report=backup_report,
+        regressed_rows=regressed_rows,
+        fetcher=fetcher,
+        current_index=args.current_index,
+        backup_index=args.backup_index,
+        top_queries=args.detail_queries,
+        max_docs_per_query=args.detail_docs_per_query,
+    )
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,303 @@
+#!/usr/bin/env python3
+"""
+Analyze per-query regressions between two batch evaluation JSON reports and
+attribute likely causes by inspecting ES documents from two indices.
+
+Outputs:
+- Top regressions by Primary_Metric_Score delta
+- For each regressed query:
+  - metric deltas
+  - top-10 SPU overlap and swapped-in SPUs
+  - for swapped-in SPUs, show which search fields contain the query term
+
+This is a heuristic attribution tool (string containment), but it's fast and
+usually enough to pinpoint regressions caused by missing/noisy fields such as
+qanchors/keywords/title in different languages.
+
+Usage:
+  set -a; source .env; set +a
+  ./.venv/bin/python scripts/inspect/analyze_eval_regressions.py \
+    --old-report artifacts/search_evaluation/batch_reports/batch_...073901....json \
+    --new-report artifacts/search_evaluation/batch_reports/batch_...074717....json \
+    --old-index search_products_tenant_163 \
+    --new-index search_products_tenant_163_backup_20260415_1438 \
+    --top-n 10
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import re
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any, Dict, Iterable, List, Optional, Sequence, Tuple
+
+from elasticsearch import Elasticsearch
+
+
+def load_json(path: str) -> Dict[str, Any]:
+    return json.loads(Path(path).read_text())
+
+
+def norm_str(x: Any) -> str:
+    if x is None:
+        return ""
+    if isinstance(x, str):
+        return x
+    return str(x)
+
+
+def is_cjk(s: str) -> bool:
+    return bool(re.search(r"[\u4e00-\u9fff]", s))
+
+
+def flatten_text_values(v: Any) -> List[str]:
+    """Extract strings from nested objects/lists (best-effort)."""
+    out: List[str] = []
+    if v is None:
+        return out
+    if isinstance(v, str):
+        return [v]
+    if isinstance(v, (int, float, bool)):
+        return [str(v)]
+    if isinstance(v, dict):
+        for vv in v.values():
+            out.extend(flatten_text_values(vv))
+        return out
+    if isinstance(v, list):
+        for vv in v[:20]:
+            out.extend(flatten_text_values(vv))
+        return out
+    return [str(v)]
+
+
+def get_lang_obj(src: Dict[str, Any], field: str, lang: str) -> Any:
+    obj = src.get(field)
+    if isinstance(obj, dict):
+        return obj.get(lang)
+    return None
+
+
+def contains_query(val: Any, query: str) -> bool:
+    q = query.strip()
+    if not q:
+        return False
+    texts = flatten_text_values(val)
+    # simple substring match (case-insensitive for non-cjk)
+    if is_cjk(q):
+        return any(q in t for t in texts)
+    ql = q.lower()
+    return any(ql in (t or "").lower() for t in texts)
+
+
+@dataclass
+class PerQuery:
+    query: str
+    metrics: Dict[str, float]
+    top_results: List[Dict[str, Any]]
+    request_id: Optional[str]
+
+
+def per_query_map(report: Dict[str, Any]) -> Dict[str, PerQuery]:
+    out: Dict[str, PerQuery] = {}
+    for rec in report.get("per_query") or []:
+        q = rec.get("query")
+        if not q:
+            continue
+        metrics = {k: float(v) for k, v in (rec.get("metrics") or {}).items() if isinstance(v, (int, float))}
+        out[q] = PerQuery(
+            query=q,
+            metrics=metrics,
+            top_results=list(rec.get("top_results") or []),
+            request_id=rec.get("request_id"),
+        )
+    return out
+
+
+def top_spus(pq: PerQuery, n: int = 10) -> List[str]:
+    spus: List[str] = []
+    for r in pq.top_results[:n]:
+        spu = r.get("spu_id")
+        if spu is not None:
+            spus.append(str(spu))
+    return spus
+
+
+def build_es() -> Elasticsearch:
+    es_url = os.environ.get("ES") or os.environ.get("ES_HOST") or "http://127.0.0.1:9200"
+    auth = os.environ.get("ES_AUTH")
+    if auth and ":" in auth:
+        user, pwd = auth.split(":", 1)
+        return Elasticsearch(hosts=[es_url], basic_auth=(user, pwd))
+    return Elasticsearch(hosts=[es_url])
+
+
+def mget_sources(es: Elasticsearch, index: str, ids: Sequence[str]) -> Dict[str, Dict[str, Any]]:
+    resp = es.mget(index=index, body={"ids": list(ids)})
+    out: Dict[str, Dict[str, Any]] = {}
+    for d in resp.get("docs") or []:
+        if d.get("found") and d.get("_id") and isinstance(d.get("_source"), dict):
+            out[str(d["_id"])] = d["_source"]
+    return out
+
+
+def non_empty(v: Any) -> bool:
+    if v is None:
+        return False
+    if isinstance(v, str):
+        return bool(v.strip())
+    if isinstance(v, (list, tuple, set)):
+        return len(v) > 0
+    if isinstance(v, dict):
+        return any(non_empty(x) for x in v.values())
+    return True
+
+
+def summarize_field(src: Dict[str, Any], field: str, lang: Optional[str]) -> Dict[str, Any]:
+    """Summarize presence and a small sample for a field (optionally language-specific)."""
+    obj = src.get(field)
+    if lang and isinstance(obj, dict):
+        obj = obj.get(lang)
+    present = non_empty(obj)
+    sample = None
+    if isinstance(obj, str):
+        sample = obj[:80]
+    elif isinstance(obj, list):
+        sample = obj[:3]
+    elif isinstance(obj, dict):
+        sample = {k: obj.get(k) for k in list(obj.keys())[:3]}
+    return {"present": present, "sample": sample}
+
+
+def main() -> int:
+    ap = argparse.ArgumentParser(description="Analyze regressions between two eval batch reports.")
+    ap.add_argument("--old-report", required=True, help="Older/worse/baseline batch JSON path")
+    ap.add_argument("--new-report", required=True, help="Newer candidate batch JSON path")
+    ap.add_argument("--old-index", required=True, help="ES index used by old report")
+    ap.add_argument("--new-index", required=True, help="ES index used by new report")
+    ap.add_argument("--top-n", type=int, default=10, help="How many worst regressions to analyze (default 10)")
+    ap.add_argument("--metric", default="Primary_Metric_Score", help="Metric to rank regressions by")
+    ap.add_argument("--topk", type=int, default=10, help="Top-K results to compare per query (default 10)")
+    args = ap.parse_args()
+
+    old = load_json(args.old_report)
+    new = load_json(args.new_report)
+    old_map = per_query_map(old)
+    new_map = per_query_map(new)
+
+    metric = args.metric
+    queries = list(new.get("queries") or old.get("queries") or [])
+
+    deltas: List[Tuple[str, float]] = []
+    for q in queries:
+        o = old_map.get(q)
+        n = new_map.get(q)
+        if not o or not n:
+            continue
+        d = float(n.metrics.get(metric, 0.0)) - float(o.metrics.get(metric, 0.0))
+        deltas.append((q, d))
+
+    deltas.sort(key=lambda x: x[1])
+    worst = deltas[: args.top_n]
+
+    print("=" * 100)
+    print(f"Top {len(worst)} regressions by {metric} (new - old)")
+    print("=" * 100)
+    for q, d in worst:
+        o = old_map[q]
+        n = new_map[q]
+        print(f"- {q}: {d:+.4f}  old={o.metrics.get(metric, 0.0):.4f} -> new={n.metrics.get(metric, 0.0):.4f}")
+
+    es = build_es()
+
+    # Fields that matter according to config.yaml
+    # (keep it aligned with multilingual_fields + best_fields/phrase_fields)
+    inspect_fields = [
+        "title",
+        "keywords",
+        "qanchors",
+        "category_name_text",
+        "vendor",
+        "tags",
+        "option1_values",
+        "option2_values",
+        "option3_values",
+    ]
+
+    print("\n" + "=" * 100)
+    print("Heuristic attribution for worst regressions")
+    print("=" * 100)
+
+    for q, d in worst:
+        o = old_map[q]
+        n = new_map[q]
+        old_spus = top_spus(o, args.topk)
+        new_spus = top_spus(n, args.topk)
+        old_set, new_set = set(old_spus), set(new_spus)
+        swapped_in = [s for s in new_spus if s not in old_set]
+        swapped_out = [s for s in old_spus if s not in new_set]
+
+        print("\n" + "-" * 100)
+        print(f"Query: {q}")
+        print(f"Delta {metric}: {d:+.4f}")
+        # show a few key metrics
+        for m in ["NDCG@20", "Strong_Precision@10", "Gain_Recall@20", "ERR@10"]:
+            if m in o.metrics and m in n.metrics:
+                print(f"  {m}: {n.metrics[m]-o.metrics[m]:+.4f}  (old {o.metrics[m]:.4f} -> new {n.metrics[m]:.4f})")
+        print(f"  old request_id={o.request_id}  new request_id={n.request_id}")
+        print(f"  top{args.topk} overlap: {len(old_set & new_set)}/{args.topk}")
+        print(f"  swapped_in (new only): {swapped_in[:10]}")
+        print(f"  swapped_out (old only): {swapped_out[:10]}")
+
+        # Fetch swapped_in docs from both indices to spot index-field differences.
+        if not swapped_in:
+            continue
+        docs_new = mget_sources(es, args.new_index, swapped_in)
+        docs_old = mget_sources(es, args.old_index, swapped_in)
+
+        lang = "zh" if is_cjk(q) else "en"
+        print(f"  language_guess: {lang}")
+        for spu in swapped_in[:8]:
+            src_new = docs_new.get(spu) or {}
+            src_old = docs_old.get(spu) or {}
+
+            title = get_lang_obj(src_new, "title", lang) or get_lang_obj(src_new, "title", "en") or ""
+            print(f"    - spu={spu} title≈{norm_str(title)[:60]!r}")
+
+            presence_new = {f: summarize_field(src_new, f, lang) for f in inspect_fields}
+            presence_old = {f: summarize_field(src_old, f, lang) for f in inspect_fields}
+
+            new_only = [f for f in inspect_fields if presence_new[f]["present"] and not presence_old[f]["present"]]
+            old_only = [f for f in inspect_fields if presence_old[f]["present"] and not presence_new[f]["present"]]
+            if new_only or old_only:
+                print(f"      field_presence_diff: new_only={new_only} old_only={old_only}")
+
+            # still report exact-substring match where it exists (often useful for English)
+            hits = []
+            for f in inspect_fields:
+                v = get_lang_obj(src_new, f, lang)
+                if v is None:
+                    v = src_new.get(f)
+                if contains_query(v, q):
+                    hits.append(f)
+            if hits:
+                print(f"      exact_substring_matched_fields: {hits}")
+
+            # compact samples for the most likely culprits
+            for f in ["qanchors", "keywords", "title"]:
+                pn = presence_new.get(f)
+                po = presence_old.get(f)
+                if pn and po and (pn["present"] or po["present"]):
+                    print(
+                        f"      {f}: new.present={pn['present']} old.present={po['present']}  "
+                        f"new.sample={pn['sample']}  old.sample={po['sample']}"
+                    )
+
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
+
@@ -0,0 +1,376 @@
+#!/usr/bin/env python3
+"""
+Compare two Elasticsearch indices:
+- mapping structure (field paths + types)
+- field coverage stats (exists; nested-safe)
+- random sample documents (same _id) and diff _source field paths
+
+Usage:
+  python scripts/inspect/compare_indices.py INDEX_A INDEX_B --sample-size 25
+  python scripts/inspect/compare_indices.py INDEX_A INDEX_B --fields title.zh,vendor.zh,keywords.zh,tags.zh --fields-nested image_embedding.url,enriched_attributes.name
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any, Dict, Iterable, List, Optional, Set, Tuple
+
+sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
+
+from utils.es_client import ESClient, get_es_client_from_env
+
+
+def _walk_mapping_properties(props: Dict[str, Any], prefix: str = "") -> Dict[str, str]:
+    """Flatten mapping properties into {field_path: type} including multi-fields."""
+    out: Dict[str, str] = {}
+    for name, node in (props or {}).items():
+        path = f"{prefix}.{name}" if prefix else name
+        if not isinstance(node, dict):
+            out[path] = "unknown"
+            continue
+        out[path] = node.get("type") or "object"
+        if isinstance(node.get("properties"), dict):
+            out.update(_walk_mapping_properties(node["properties"], path))
+        if isinstance(node.get("fields"), dict):
+            for sub, subnode in node["fields"].items():
+                if isinstance(subnode, dict):
+                    out[f"{path}.{sub}"] = subnode.get("type") or "object"
+                else:
+                    out[f"{path}.{sub}"] = "unknown"
+    return out
+
+
+def _get_top_level_field_type(mapping: Dict[str, Any], top_field: str) -> Optional[str]:
+    props = mapping.get("mappings", {}).get("properties", {}) or {}
+    node = props.get(top_field)
+    if not isinstance(node, dict):
+        return None
+    return node.get("type") or "object"
+
+
+def _field_paths_from_source(obj: Any, prefix: str = "", list_depth: int = 3) -> Set[str]:
+    """Return dotted field paths found in _source. For lists, uses '[]' marker."""
+    out: Set[str] = set()
+    if isinstance(obj, dict):
+        for k, v in obj.items():
+            p = f"{prefix}.{k}" if prefix else k
+            out.add(p)
+            out |= _field_paths_from_source(v, p, list_depth=list_depth)
+    elif isinstance(obj, list):
+        # Do not explode: just traverse first N elements
+        for v in obj[:list_depth]:
+            p = f"{prefix}[]" if prefix else "[]"
+            out |= _field_paths_from_source(v, p, list_depth=list_depth)
+    return out
+
+
+def _chunks(seq: List[str], size: int) -> Iterable[List[str]]:
+    for i in range(0, len(seq), size):
+        yield seq[i : i + size]
+
+
+@dataclass(frozen=True)
+class CoverageField:
+    field: str
+    # If set, use nested query with this path (e.g. "image_embedding").
+    nested_path: Optional[str] = None
+
+
+def _infer_coverage_fields(
+    mapping: Dict[str, Any],
+    raw_fields: List[str],
+    raw_nested_fields: List[str],
+) -> List[CoverageField]:
+    """
+    Build coverage fields list. For fields in raw_nested_fields, always treat as nested
+    and infer nested path as first segment.
+    For raw_fields, auto-detect nested by checking mapping top-level field type.
+    """
+    out: List[CoverageField] = []
+
+    nested_set = {f.strip() for f in raw_nested_fields if f.strip()}
+    for f in nested_set:
+        path = f.split(".", 1)[0]
+        out.append(CoverageField(field=f, nested_path=path))
+
+    for f in [x.strip() for x in raw_fields if x.strip()]:
+        if f in nested_set:
+            continue
+        top = f.split(".", 1)[0]
+        top_type = _get_top_level_field_type(mapping, top)
+        if top_type == "nested":
+            out.append(CoverageField(field=f, nested_path=top))
+        else:
+            out.append(CoverageField(field=f, nested_path=None))
+
+    # stable order (nested first then normal, but preserve user order otherwise)
+    seen: Set[Tuple[str, Optional[str]]] = set()
+    dedup: List[CoverageField] = []
+    for cf in out:
+        key = (cf.field, cf.nested_path)
+        if key in seen:
+            continue
+        seen.add(key)
+        dedup.append(cf)
+    return dedup
+
+
+def _count_exists(es, index: str, cf: CoverageField) -> int:
+    """
+    Count docs where field exists.
+    - If nested_path is set, uses nested query (safe for nested fields).
+    - If nested query fails because path isn't actually nested in that index,
+      fall back to a non-nested exists query to avoid crashing the whole report.
+    """
+    if cf.nested_path:
+        nested_body = {
+            "query": {
+                "nested": {
+                    "path": cf.nested_path,
+                    "query": {"exists": {"field": cf.field}},
+                }
+            }
+        }
+        try:
+            return int(es.count(index, body=nested_body))
+        except Exception as e:
+            # Most common: "[nested] failed to find nested object under path [...]"
+            print(f"[warn] nested exists failed for {index} field={cf.field} path={cf.nested_path}: {type(e).__name__}")
+            # fall through to exists
+    body = {"query": {"exists": {"field": cf.field}}}
+    return int(es.count(index, body=body))
+
+
+def _print_json(obj: Any) -> None:
+    print(json.dumps(obj, ensure_ascii=False, indent=2, sort_keys=False))
+
+
+def compare_mapping(index_a: str, index_b: str, mapping_a: Dict[str, Any], mapping_b: Dict[str, Any]) -> None:
+    flat_a = _walk_mapping_properties(mapping_a.get("mappings", {}).get("properties", {}) or {})
+    flat_b = _walk_mapping_properties(mapping_b.get("mappings", {}).get("properties", {}) or {})
+
+    only_a = sorted(set(flat_a) - set(flat_b))
+    only_b = sorted(set(flat_b) - set(flat_a))
+    type_diff = sorted([k for k in set(flat_a) & set(flat_b) if flat_a[k] != flat_b[k]])
+
+    print("\n" + "=" * 90)
+    print("Mapping diff (flattened field paths + types)")
+    print("=" * 90)
+    print(f"index_a: {index_a}")
+    print(f"index_b: {index_b}")
+    print(f"only_in_a: {len(only_a)}")
+    print(f"only_in_b: {len(only_b)}")
+    print(f"type_diff: {len(type_diff)}")
+
+    if only_a[:50]:
+        print("\nFields only in index_a (first 50):")
+        for f in only_a[:50]:
+            print(f"  - {f} ({flat_a.get(f)})")
+        if len(only_a) > 50:
+            print(f"  ... and {len(only_a) - 50} more")
+
+    if only_b[:50]:
+        print("\nFields only in index_b (first 50):")
+        for f in only_b[:50]:
+            print(f"  - {f} ({flat_b.get(f)})")
+        if len(only_b) > 50:
+            print(f"  ... and {len(only_b) - 50} more")
+
+    if type_diff[:50]:
+        print("\nFields with different types (first 50):")
+        for f in type_diff[:50]:
+            print(f"  - {f}: a={flat_a.get(f)} b={flat_b.get(f)}")
+        if len(type_diff) > 50:
+            print(f"  ... and {len(type_diff) - 50} more")
+
+
+def compare_coverage(
+    es,
+    index_a: str,
+    index_b: str,
+    mapping_a: Dict[str, Any],
+    mapping_b: Dict[str, Any],
+    fields: List[str],
+    nested_fields: List[str],
+) -> None:
+    cov_fields_a = _infer_coverage_fields(mapping_a, fields, nested_fields)
+    cov_fields_b = _infer_coverage_fields(mapping_b, fields, nested_fields)
+
+    # keep shared list, but warn if inference differs (it shouldn't)
+    if [c.field for c in cov_fields_a] != [c.field for c in cov_fields_b]:
+        print("\n[warn] coverage field list differs between indices; using index_a inference as baseline")
+    cov_fields = cov_fields_a
+
+    print("\n" + "=" * 90)
+    print("Field coverage stats (count of docs where field exists)")
+    print("=" * 90)
+    print(f"index_a: {index_a}")
+    print(f"index_b: {index_b}")
+
+    for cf in cov_fields:
+        mode = f"nested(path={cf.nested_path})" if cf.nested_path else "exists"
+        a = _count_exists(es, index_a, cf)
+        b = _count_exists(es, index_b, cf)
+        print(f"\n- {cf.field}  [{mode}]")
+        print(f"  {index_a}: {a}")
+        print(f"  {index_b}: {b}")
+
+
+def compare_random_samples(
+    es,
+    index_a: str,
+    index_b: str,
+    sample_size: int,
+    random_seed: Optional[int],
+) -> None:
+    print("\n" + "=" * 90)
+    print("Random sample diff (same _id; diff _source field paths)")
+    print("=" * 90)
+    print(f"sample_size: {sample_size}")
+
+    random_score: Dict[str, Any] = {}
+    if random_seed is not None:
+        random_score["seed"] = random_seed
+
+    sample_body = {
+        "size": sample_size,
+        "_source": False,
+        "query": {"function_score": {"query": {"match_all": {}}, "random_score": random_score}},
+    }
+
+    # Use the underlying client directly to avoid passing duplicate `size`
+    # parameters through the wrapper.
+    resp = es.client.search(index=index_a, body=sample_body)
+    hits = (((resp or {}).get("hits") or {}).get("hits") or [])
+    ids = [h.get("_id") for h in hits if h.get("_id") is not None]
+
+    if not ids:
+        print("No hits returned; cannot sample.")
+        return
+
+    # mget in chunks
+    def mget(index: str, ids_: List[str]) -> Dict[str, Dict[str, Any]]:
+        out: Dict[str, Dict[str, Any]] = {}
+        for batch in _chunks(ids_, 500):
+            docs = es.client.mget(index=index, body={"ids": batch}).get("docs") or []
+            for d in docs:
+                if d.get("found") and d.get("_id") and isinstance(d.get("_source"), dict):
+                    out[d["_id"]] = d["_source"]
+        return out
+
+    a_docs = mget(index_a, ids)
+    b_docs = mget(index_b, ids)
+
+    missing_in_b = [i for i in ids if i in a_docs and i not in b_docs]
+    missing_in_a = [i for i in ids if i in b_docs and i not in a_docs]
+
+    only_in_a: Set[str] = set()
+    only_in_b: Set[str] = set()
+
+    matched = 0
+    for _id in ids:
+        if _id in a_docs and _id in b_docs:
+            matched += 1
+            pa = _field_paths_from_source(a_docs[_id])
+            pb = _field_paths_from_source(b_docs[_id])
+            only_in_a |= (pa - pb)
+            only_in_b |= (pb - pa)
+
+    summary = {
+        "sample_size": len(ids),
+        "matched": matched,
+        "missing_in_index_b_count": len(missing_in_b),
+        "missing_in_index_a_count": len(missing_in_a),
+        "missing_in_index_b_example": missing_in_b[:5],
+        "missing_in_index_a_example": missing_in_a[:5],
+        "fields_only_in_index_a_count": len(only_in_a),
+        "fields_only_in_index_b_count": len(only_in_b),
+        "fields_only_in_index_a_first80": sorted(list(only_in_a))[:80],
+        "fields_only_in_index_b_first80": sorted(list(only_in_b))[:80],
+    }
+    _print_json(summary)
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser(description="Compare two ES indices (mapping + data coverage + random sample).")
+    parser.add_argument("index_a", help="Index A name")
+    parser.add_argument("index_b", help="Index B name")
+    parser.add_argument("--sample-size", type=int, default=25, help="Random sample size (default: 25)")
+    parser.add_argument("--seed", type=int, default=None, help="Random seed for random_score (optional)")
+    parser.add_argument(
+        "--es-url",
+        default=None,
+        help="Elasticsearch URL. If omitted, uses env ES (preferred) or config/config.yaml.",
+    )
+    parser.add_argument(
+        "--es-auth",
+        default=None,
+        help="Basic auth in 'user:pass' form. If omitted, uses env ES_AUTH or config credentials.",
+    )
+    parser.add_argument(
+        "--fields",
+        default="title.zh,vendor.zh,keywords.zh,tags.zh,keywords.en,tags.en,enriched_taxonomy_attributes,image_embedding.url,enriched_attributes.name",
+        help="Comma-separated fields to compute coverage for (default: a sensible set)",
+    )
+    parser.add_argument(
+        "--fields-nested",
+        default="image_embedding.url,enriched_attributes.name",
+        help="Comma-separated fields that must be treated as nested exists (default: image_embedding.url,enriched_attributes.name)",
+    )
+    args = parser.parse_args()
+
+    # Prefer doc-style env vars (ES/ES_AUTH) to match ops workflow in docs/常用查询 - ES.md.
+    # Fallback to config/config.yaml for repo-local tooling.
+    env = __import__("os").environ
+    es_url = args.es_url or (env.get("ES") or env.get("ES_HOST") or None)
+    es_auth = args.es_auth or env.get("ES_AUTH")
+    # Doc convention: if ES is unset, default to localhost:9200.
+    if not es_url and es_auth:
+        es_url = "http://127.0.0.1:9200"
+
+    if es_url:
+        username = password = None
+        if es_auth and ":" in es_auth:
+            username, password = es_auth.split(":", 1)
+        es = ESClient(hosts=[es_url], username=username, password=password)
+    else:
+        es = get_es_client_from_env()
+
+    if not es.ping():
+        print("✗ Cannot connect to Elasticsearch")
+        return 2
+
+    if not es.index_exists(args.index_a):
+        print(f"✗ index not found: {args.index_a}")
+        return 2
+    if not es.index_exists(args.index_b):
+        print(f"✗ index not found: {args.index_b}")
+        return 2
+
+    mapping_all_a = es.get_mapping(args.index_a) or {}
+    mapping_all_b = es.get_mapping(args.index_b) or {}
+    if args.index_a not in mapping_all_a or args.index_b not in mapping_all_b:
+        print("✗ Failed to fetch mappings for both indices")
+        return 2
+
+    mapping_a = mapping_all_a[args.index_a]
+    mapping_b = mapping_all_b[args.index_b]
+
+    compare_mapping(args.index_a, args.index_b, mapping_a, mapping_b)
+
+    fields = [x for x in (args.fields or "").split(",") if x.strip()]
+    nested_fields = [x for x in (args.fields_nested or "").split(",") if x.strip()]
+    compare_coverage(es, args.index_a, args.index_b, mapping_a, mapping_b, fields, nested_fields)
+
+    compare_random_samples(es, args.index_a, args.index_b, args.sample_size, args.seed)
+
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
+
@@ -9,6 +9,7 @@ source ./activate.sh
 EVAL_WEB_PORT="${EVAL_WEB_PORT:-6010}"
 EVAL_WEB_HOST="${EVAL_WEB_HOST:-0.0.0.0}"
 TENANT_ID="${TENANT_ID:-163}"
+DATASET_ID="${REPO_EVAL_DATASET_ID:-core_queries}"
 QUERIES="${REPO_EVAL_QUERIES:-scripts/evaluation/queries/queries.txt}"
 GREEN='\033[0;32m'
@@ -21,10 +22,11 @@ echo -e &quot;${GREEN}========================================${NC}&quot;
 echo -e "\n${YELLOW}Evaluation UI:${NC} ${GREEN}http://localhost:${EVAL_WEB_PORT}/${NC}"
 echo -e "${YELLOW}Requires backend for live search:${NC} ${GREEN}http://localhost:${API_PORT:-6002}${NC}\n"
-export EVAL_WEB_PORT EVAL_WEB_HOST TENANT_ID REPO_EVAL_QUERIES
+export EVAL_WEB_PORT EVAL_WEB_HOST TENANT_ID REPO_EVAL_DATASET_ID REPO_EVAL_QUERIES
 exec python scripts/evaluation/serve_eval_web.py serve \
   --tenant-id "${TENANT_ID}" \
+  --dataset-id "${DATASET_ID}" \
   --queries-file "${QUERIES}" \
   --host "${EVAL_WEB_HOST}" \
   --port "${EVAL_WEB_PORT}"
@@ -0,0 +1,18 @@
+from config.loader import get_app_config
+from scripts.evaluation.eval_framework.datasets import resolve_dataset
+
+
+def test_search_evaluation_registry_contains_expected_datasets() -> None:
+    se = get_app_config().search_evaluation
+    ids = [item.dataset_id for item in se.datasets]
+    assert "core_queries" in ids
+    assert "clothing_top771" in ids
+    assert se.default_dataset_id == "core_queries"
+
+
+def test_resolve_dataset_returns_expected_query_counts() -> None:
+    core = resolve_dataset(dataset_id="core_queries")
+    clothing = resolve_dataset(dataset_id="clothing_top771")
+    assert core.query_count > 0
+    assert clothing.query_count == 771
+    assert clothing.dataset_id == "clothing_top771"