docs/issues/issue-2026-04-16-bayes%E5%AF%BB%E5%8F%82-TODO.md

一、扩展评估标注集
二、在大标注集上寻参
三、（暂时不做，克制，业务发展初期不要做！做通用性强维护成本低的！）coarse rank使用LTR（各个因子做多个非线性映射+FM拟合 pairwise，参考ranknet）
一、扩展评估标注集
参考当前的评估框架
@scripts/evaluation/README.md @scripts/evaluation/eval_framework/framework.py 
@start_eval.sh.sh
当前，是基于54个评测样本（queries.txt），建立了自动化评估的系统，便于发现策略在这个评估集上的效果。
我需要扩大评估样本，将样本扩大到1k条，文件是scripts/evaluation/queries/all_keywords.txt.top1w.shuf.top1k
但是这个文件还混杂了一些非“服饰鞋帽”类搜索词，请先做一遍清理。
用llm做剔出，每次输入50条，提示词是：
Please filter out the queries from the following list that do not belong to the clothing, shoes, and accessories category. Output the original list of queries, one query per line, without any additional content.
然后将返回的，从原始query剔出。
生成文件：all_keywords.txt.top1w.shuf.top1k.clothing_filtered
然后以all_keywords.txt.top1w.shuf.top1k.clothing_filtered为query集合，走标注流程，从而新建一个标注集。
那么以后eval-web服务，现在的Batch Evaluation按钮，应该支持多个评估集合，左侧的History，也有对应多个评估集合的评估结果，请你考虑如何支持、如何设计。请进行统一的设计，不要补丁式的支持。
二、在大标注集上寻参
我以前经过过一轮调参，是基于54个评测样本（queries.txt），过程中发现的最优的参数是这一组：
0.641241 {'es_bias': '7.214', 'es_exponent': '0.2025', 'text_bias': '4.0', 'text_exponent': '1.584', 'text_translation_weight': '1.4441', 'knn_text_weight': '0.1', 'knn_image_weight': '5.6232', 'knn_tie_breaker':
    '0.021', 'knn_bias': '0.0019', 'knn_exponent': '11.8477', 'knn_text_bias': '2.3125', 'knn_text_exponent': '1.1547', 'knn_image_bias': '0.9641', 'knn_image_exponent': '5.8671'}
这一组参数分布比较极端，text_bias太大（文本项得分事0~1的，加上4被稀释的很大），图片的exponent太大，不过在这个数据集上面确实是最好的，我觉得有过拟合的可能，因此要扩大数据集，先扩展标注集，然后使用扩展的标注集，继续进行寻参。
我已经新建了一个标注集合，请启动寻参任务，把任务跑起来，以后等程序跑完了应该能拿到寻参结果，下次你可以结合程序执行的结果进行结论分析。
以前的一轮调参：
我当时的调参需求：
请对coarse_rank fusion公式进行调参：
  目前的baseline是这一组，Primary_Metric_Score: 0.637642：
  coarse_rank:
  ...
    fusion:
      es_bias: 10.0
      es_exponent: 0.05
      text_bias: 0.1
      text_exponent: 0.35
      text_translation_weight: 1.0
      knn_text_weight: 1.0
      knn_image_weight: 2.0
      knn_tie_breaker: 0.3
      knn_bias: 0.2
      knn_exponent: 5.6
      knn_text_bias: 0.2
      knn_text_exponent: 0.0
      knn_image_bias: 0.2
      knn_image_exponent: 0.0
  评估指标在/data/saas-search/artifacts/search_evaluation/batch_reports/batch_20260415T150754Z_00b6a8aa3d.md
  请以这个为基准，发散思维，进行宽一点的范围调参。因为每次重启、评测都需要几分钟，请写一个调参的框架，基于框架、设定好多组参数、写好脚本，每轮收集效果自动的调整参数分布（因为笛卡尔积、多种参数调参成本太高，因此考虑贝叶斯调参等方法通
  过多轮迭代通过脚本自动收敛）
  每次调整参数后需要重启backend （有时候重启backend后eval-web好像也挂了，如果有可以追查原因并修复）
  ./restart.sh backend
  注意：请你调试脚本、进行一轮分析，最终要沉淀一套调参脚本，我下次可以重新跑（还是针对这组参数调参），能自动迭代（调整参数分布）、收集每组参数的指标、调优到最优的参数组合。
你给出的调参脚本：
  长时间自动调参脚本已经配好，核心文件是：
  - 调参主程序：scripts/evaluation/tune_fusion.py
  - 宽范围搜索空间：scripts/evaluation/tuning/coarse_rank_fusion_space.yaml
  - 后台长跑启动脚本：scripts/evaluation/start_coarse_fusion_tuning_long.sh
  - 续跑脚本：scripts/evaluation/resume_coarse_fusion_tuning_long.sh
  - 使用说明：scripts/evaluation/tuning/README.md
  这套方案不是笛卡尔积穷举，而是“种子实验 + 随机探索 + 基于历史最优分布的自适应采样”反复迭代。搜索范围已经放宽，新增了多组偏离 baseline 很远的 seed，也保留了 baseline 附近抖动，适合长时间慢慢收敛。
  直接启动一轮：
  ./scripts/evaluation/start_coarse_fusion_tuning_long.sh
  如果想跑更久一点：
  MAX_EVALS=48 BATCH_SIZE=3 CANDIDATE_POOL_SIZE=512 RUN_NAME=coarse_fusion_long_001 \
  ./scripts/evaluation/start_coarse_fusion_tuning_long.sh
  看日志：
  tail -f artifacts/search_evaluation/tuning_launches/<run_name>.log
  看结果目录：
  ls artifacts/search_evaluation/tuning_runs/<run_name>/
  续跑：
  ./scripts/evaluation/resume_coarse_fusion_tuning_long.sh <run_name>
  每轮会自动做这些事：
  - 生成一批候选 coarse_rank.fusion 参数
  - 写入配置并重启 backend
  - 检查 eval-web，必要时自愈拉起
  - 跑评测并收集 Primary_Metric_Score
  - 更新 trials.jsonl、leaderboard.csv、summary.md
  - 基于历史结果调整下一轮采样分布
  你先跑一轮长任务。跑完后把 run_name 或 scripts/evaluation/tuning/README.md 里对应实验目录发我，我再基于结果做下一步分析和收缩搜索空间。
› 我运行了：MAX_EVALS=400 BATCH_SIZE=3 CANDIDATE_POOL_SIZE=512 RUN_NAME=coarse_fusion_long_001 ./scripts/evaluation/start_coarse_fusion_tuning_long.sh
  run_name=coarse_fusion_long_001
  pid=2218620
  log=artifacts/search_evaluation/tuning_launches/coarse_fusion_long_001.log
  pid_file=artifacts/search_evaluation/tuning_launches/coarse_fusion_long_001.pid
  cmd_file=artifacts/search_evaluation/tuning_launches/coarse_fusion_long_001.cmd
  run_dir=artifacts/search_evaluation/tuning_runs/coarse_fusion_long_001
  tail -f artifacts/search_evaluation/tuning_launches/coarse_fusion_long_001.log
  cat artifacts/search_evaluation/tuning_runs/coarse_fusion_long_001/leaderboard.csv
• 我先看这次长跑的日志和当前排行榜，确认任务是否正常推进、是否有异常中断，以及目前最优参数落在哪个方向。
• Explored
  └ Read coarse_fusion_long_001.log, leaderboard.csv
    List coarse_fusion_long_001
    Read summary.md
───
• 结果已经足够看出明显趋势了。我再补一层聚合分析，看看 top 区间里哪些参数方向反复出现，避免只盯着单个最优点。
• Ran ps -p 2218620 -o pid=,etimes=,cmd=
  └ (no output)
• Ran python - <<'PY'
  │ import csv, statistics, math
  │ from pathlib import Path
  │ … +15 lines
  └ rows 385
    … +61 lines (ctrl + t to view transcript)
    knn_image_bias: mean=0.5631 median=0.2000 min=0.0010 max=4.0000
    knn_image_exponent: mean=4.6316 median=5.8680 min=0.0000 max=6.0000
• Ran python - <<'PY'
  │ import csv
  │ from pathlib import Path
  │ … +19 lines
  └ baseline_score 0.637642
    … +30 lines (ctrl + t to view transcript)
    10 bo_234 0.641241 {'es_bias': '7.214', 'es_exponent': '0.2025', 'text_bias': '4.0', 'text_exponent': '1.584', 'text_translation_weight': '1.4441', 'knn_text_weight': '0.1', 'knn_image_weight': '5.6232', 'knn_tie_breaker':
    '0.021', 'knn_bias': '0.0019', 'knn_exponent': '11.8477', 'knn_text_bias': '2.3125', 'knn_text_exponent': '1.1547', 'knn_image_bias': '0.9641', 'knn_image_exponent': '5.8671'}
这一次因为外部原因（磁盘满）终止了，以上是最好的一组参数。