docs/issues/issue-2026-04-02-LTR%E6%97%A5%E5%BF%97%E5%87%86%E5%A4%87&FM%E6%8B%9F%E5%90%88%E6%95%88%E6%9E%9C%E5%88%9D%E6%8E%A2-done-0402.md

一、主指标改为： 
NDCG@20 + NDCG@50 + ERR@10 + Strong_Precision@10 + Strong_Precision@20 + Useful_Precision@50 + Avg_Grade@10 + Gain_Recall@20
二、完善日志，阅读代码 理解rank漏斗，梳理重要信号，为LTR做准备 
三、然后进行调参（最新的一版报告batch_20260402T054003Z_00b6a8aa3d.md是基线）：
只针对重排融合公式（config.yaml)：
  fusion:
    es_bias: 10.0
    es_exponent: 0.05
    rerank_bias: 0.1
    rerank_exponent: 1.15
    fine_bias: 0.1
    fine_exponent: 1.0
    text_bias: 0.1
    text_exponent: 0.25
    text_translation_weight: 0.8
    knn_text_weight: 1.0
    knn_image_weight: 2.0
    knn_tie_breaker: 0.3
    knn_bias: 0.6
    knn_exponent: 0.4
主要考虑方向有：
各项bias调整（调大）
提高knn_image_weight
提高knn_exponent
提高es_exponent和降低text_exponent
四、选择几条关键query，仔细观察日志，基于你的理解，进行优化。包括但不限于文本、图片knn召回的条数的调参。
tassel maxi skirt
minimalist top
sock boots（有很多高相关的未召回）
图案连身衣
过程中要站在更高的视角理解rank体系，遇到不合理的地方可以重新设计和优化，积累的经验和工作流程要沉淀为文档。
排序日志补强了rerank_client.py 和 searcher.py， 现在会输出 retrieval_plan、ltr_summary、以及 doc/stage 级 ltr_features，方便直接看 funnel，也方便后续做 LTR。参考_build_ltr_feature_block、_summarize_ltr_features等函数。做一版离线LTR拟合的工作（离线脚本验证）。
思路是查看和解析特征日志./logs/backend_verbose.log 中正好有且只有评估框架最新跑的一轮特征日志，仔细思考rank体系、查看数据、提炼线索，做一版特征工程，label为4个级别，参考 [framework.py](scripts/evaluation/eval_framework/framework.py) ，损失使用RankNet Pairwise Loss（不需要deltandcg权重，先简单点，大体思路是一个query内任意两个label不同的组pair形成一个有效pair做二分类样本，请你找到简易的RankNet Pairwise Loss实现）。拟合函数考虑FM（要做一些简单的特征工程比如对每个数值型特征都做多种pow log等非线性映射发散出多个特征）
需求：
[检索调参与LTR工作流.md](docs/检索调参与LTR工作流.md) 
The ranking logs have enhanced `rerank_client.py` and `searcher.py`, and now output `retrieval_plan`, `ltr_summary`, and doc/stage-level `ltr_features`, making it easy to view the funnel and facilitate subsequent LTR. Refer to functions such as `_build_ltr_feature_block` and `_summarize_ltr_features`. Perform an offline LTR fitting task (offline script validation).
The idea is to inspect and parse the feature logs in `./logs/backend_verbose.log`, which contains exactly and only the latest round of feature logs from the evaluation framework. Carefully consider the ranking system, examine the data, extract insights, and perform feature engineering. The labels have 4 levels. Refer to `[framework.py](scripts/evaluation/eval_framework/framework.py)`. Use the RankNet Pairwise Loss (no need for deltaNDCG weight; keep it simple first. The general idea is to form a valid pair for each pair of items with different labels within a query as a binary classification sample. Find a simple implementation of RankNet Pairwise Loss). Consider FM as the fitting function (do some simple feature engineering, such as applying various nonlinear transformations like pow, log, etc., to each numerical feature to expand into multiple features).
回答：
Implemented an offline LTR fitter in [offline_ltr_fit.py](/data/saas-search/scripts/evaluation/offline_ltr_fit.py) that:
- parses `logs/backend_verbose.log` `search_response` events
- joins per-query `spu_id` rows with 4-level labels from `artifacts/search_evaluation/search_eval.sqlite3`
- builds engineered features from `ltr_features`, stage scores, ranks, and nonlinear transforms like `log1p`, `sqrt`, square, inverse, plus a few ratio/cross features
- trains an FM with RankNet-style pairwise logistic loss
- runs query-group 5-fold offline validation
- saves the fitted model and predictions as artifacts
I validated it end-to-end with:
```bash
./.venv/bin/python scripts/evaluation/offline_ltr_fit.py
```
Latest run artifacts are in [summary.json](/data/saas-search/artifacts/search_evaluation/ltr_runs/offline_ltr_20260402T074937Z/summary.json) and [fm_ranknet_model.json](/data/saas-search/artifacts/search_evaluation/ltr_runs/offline_ltr_20260402T074937Z/fm_ranknet_model.json).
Key result: cross-validated `Primary_Metric_Score` improved from `0.641844` for current fused ranking to `0.654043` for the FM model. On the full-data refit, the model reaches `0.752009`, which is optimistic in-sample but shows the signal is learnable. The training set used `54` queries, `5400` docs, and `154592` valid label-different pairs.
需求：
1. The essence of FM is the weights of second-order interactions. Please output the visualized feature names (including interaction features) and their corresponding weights, sorted by importance (you can decide how to measure the importance of weights; don't make it too complicated. If no better method, simply use the absolute value of the weight).  
2. It is necessary to split 10 results as a test set to measure generalization performance, so as to determine whether overfitting has occurred.
回答：
Updated [offline_ltr_fit.py](/data/saas-search/scripts/evaluation/offline_ltr_fit.py) to do both:
- export FM feature importance
- reserve a dedicated 10-query holdout test set for generalization
I interpreted “10 results as a test set” as `10 queries`, because ranking generalization should be measured at the query-group level, not by splitting docs from the same query across train/test.
Artifacts from the latest run are under:
- [summary.json](/data/saas-search/artifacts/search_evaluation/ltr_runs/offline_ltr_20260402T080116Z/summary.json)
- [feature_importance_linear.csv](/data/saas-search/artifacts/search_evaluation/ltr_runs/offline_ltr_20260402T080116Z/feature_importance_linear.csv)
- [feature_importance_interactions.csv](/data/saas-search/artifacts/search_evaluation/ltr_runs/offline_ltr_20260402T080116Z/feature_importance_interactions.csv)
- [fm_ranknet_model.json](/data/saas-search/artifacts/search_evaluation/ltr_runs/offline_ltr_20260402T080116Z/fm_ranknet_model.json)
Key results:
- Cross-validation `Primary_Metric_Score`: `0.654043`
- Current fused baseline: `0.641844`
- 10-query holdout FM: `0.53056`
- 10-query holdout current fused baseline: `0.5674`
So the FM looks promising in CV, but the 10-query holdout is worse than the current baseline, which is a real overfitting/generalization warning.
Top linear features by absolute weight:
- `text_knn_score__square` → `0.9936`
- `knn_primary_score__square` → `0.9529`
- `has_translation_match` → `-0.9309`
- `knn_score__square` → `0.9224`
- `text_support_score__square` → `0.8481`
Top interaction features by absolute weight:
- `text_score_fallback_to_es * initial_rank__log` → `0.7099`
- `text_support_score__log1p * initial_rank__log` → `0.6814`
- `text_knn_score__square * initial_rank__log` → `-0.6419`
- `has_text_knn * initial_rank__log` → `0.5879`
- `translation_share * source_share` → `0.5509`
The 10 holdout queries were:
- `date night dress`
- `skiing trip insulated base layer`
- `streetwear t-shirt`
- `swim dress`
- `tall slim fit trousers`
- `tassel maxi skirt`
- `чёрное летнее платье`
- `فستان أسود متوسط الطول`
- `图案连身衣`
- `宽松开襟羊毛衫`
Next useful step is pruning the feature space and reducing the heavy dependence on rank-derived interactions like `initial_rank__log`, then rerunning the same holdout test.
已把 LTR 相关内容系统性补充进 [检索调参与LTR工作流.md](/data/saas-search/docs/检索调参与LTR工作流.md)。
把文档扩成了更适合正式立项的版本，重点补了这些层面：
- LTR 项目的数据闭环：日志特征、离线标签、样本构造、训练、验证、上线前判断
- 当前日志结构如何支撑 LTR：`retrieval_plan`、`ltr_features`、`ranking_funnel` 分别提供什么信息
- 样本定义：为什么基础单元是 `query-doc`，训练时再转成 pair
- label 获取与边界：`search_eval.sqlite3`、4 档标签语义、当前标签体系的适用范围和风险
- 特征工程思路：原始特征、非线性变换、比例特征、交叉前的人工组合特征
- 模型与 loss 选择：为什么第一版用 FM + RankNet Pairwise Loss
- 当前离线实验结果解读：CV 提升、10-query holdout 掉分、这意味着什么
- FM 权重如何解释：一阶权重、二阶交互权重、如何用 importance 指导后续特征裁剪
- 正式开展 LTR 项目的建议路线：数据资产固化、feature ablation、模型升级、上线节奏
如果你愿意，我下一步可以继续把这份文档再往“可执行方案”推进一步，补一节：
- `训练样本落盘格式设计`
- `线上/离线特征一致性约束`
- `LTR 服务接入 searcher/rerank 的上线方案`