22 Apr, 2026
2 commits
-
- 把 batch timeout 改成“可无限长跑”: - [tune_fusion.py](/data/saas-search/scripts/evaluation/tune_fusion.py:400) - 现在 `--batch-eval-timeout-sec <= 0` 时,不再给 `subprocess.run` 设置 Python 层超时 - 新增 resilient wrapper,负责自动续跑: - [run_coarse_fusion_tuning_resilient.sh](/data/saas-search/scripts/evaluation/run_coarse_fusion_tuning_resilient.sh) - 逻辑是:检查 `trials.jsonl` 里已完成的 live eval 数量,没到 `max_evals` 就继续 `resume-run` - 即使异常退出,也会 sleep 后自动从已有 `run_dir` 继续 - 启动/续跑脚本都切到 resilient 模式: - [start_coarse_fusion_tuning_long.sh](/data/saas-search/scripts/evaluation/start_coarse_fusion_tuning_long.sh) - [resume_coarse_fusion_tuning_long.sh](/data/saas-search/scripts/evaluation/resume_coarse_fusion_tuning_long.sh) **当前任务** - `run_name`: `coarse_fusion_clothing_top771_resilient_20260422T091650Z` - `run_dir`: [coarse_fusion_clothing_top771_resilient_20260422T091650Z](/data/saas-search/artifacts/search_evaluation/tuning_runs/coarse_fusion_clothing_top771_resilient_20260422T091650Z) - `launch log`: [coarse_fusion_clothing_top771_resilient_20260422T091650Z.log](/data/saas-search/artifacts/search_evaluation/tuning_launches/coarse_fusion_clothing_top771_resilient_20260422T091650Z.log) **已确认** - wrapper 已启动并进入 `attempt=1` - 真正传入的是 `--batch-eval-timeout-sec 0` - `tune_fusion.py` 正在运行 - `build_annotation_set.py batch` 已经在运行 - `eval.log` 已经打出这轮的前几条 query 评测进度,说明不是空转 **监控方式** - `tail -f artifacts/search_evaluation/tuning_launches/coarse_fusion_clothing_top771_resilient_20260422T091650Z.log` - `tail -f logs/eval.log` - `tail -f artifacts/search_evaluation/tuning_runs/coarse_fusion_clothing_top771_resilient_20260422T091650Z/trials.jsonl` - `cat artifacts/search_evaluation/tuning_runs/coarse_fusion_clothing_top771_resilient_20260422T091650Z/leaderboard.csv` **这次和上次的关键区别** - 上次是“单轮 batch 被 Python 超时截断” - 这次是“单轮不设 Python 超时 + 外层 wrapper 自动续跑” - 所以长时间运行、中途中断、再恢复,都会沿着同一个 `run_dir` 往下推进