issues整理（评估框架&LTR日志准备&FM拟合效果初探）

tangwang
1 parent 41345271
Showing 18 changed files with 488 additions and 0 deletions Show diff stats
docs/image-dress2.png
docs/image-shoes1.png
docs/image-dress1.png -> docs/image-spu-specifications.png
docs/issue-2026-03-21-ES能力提升.md -> docs/issues/issue-2026-03-21-ES能力提升.md
docs/issue-2026-03-21-意图判断-done03-24.md -> docs/issues/issue-2026-03-21-意图判断-done03-24.md
docs/issue-2026-03-26-ES文本搜索-补充多模态knn放入should-done-0327.md -> docs/issues/issue-2026-03-26-ES文本搜索-补充多模态knn放入should-done-0327.md
docs/issue-2026-03-27-keywords限定-done-0327.txt -> docs/issues/issue-2026-03-27-keywords限定-done-0327.txt
docs/issue-2026-03-27-添加粗排精排-第2轮-done-0327.txt -> docs/issues/issue-2026-03-27-添加粗排精排-第2轮-done-0327.txt
docs/issue-2026-03-27-添加粗排精排-第3轮-done-0327.txt -> docs/issues/issue-2026-03-27-添加粗排精排-第3轮-done-0327.txt
docs/issue-2026-03-27-添加粗排精排-第一轮-done-0327.txt -> docs/issues/issue-2026-03-27-添加粗排精排-第一轮-done-0327.txt
docs/issue-2026-03-28-添加粗排精排-第4轮-done-0328.txt -> docs/issues/issue-2026-03-28-添加粗排精排-第4轮-done-0328.txt
docs/issue-2026-03-29-索引修改-done-0330.md -> docs/issues/issue-2026-03-29-索引修改-done-0330.md
docs/issue-2026-03-30-query分析性能优化-done-0331.md -> docs/issues/issue-2026-03-30-query分析性能优化-done-0331.md
docs/issue-2026-03-31-评估框架-done-0331.md -> docs/issues/issue-2026-03-31-评估框架-done-0331.md
docs/issues/issue-2026-04-01-评估框架-四级label-done-0402.md
docs/issues/issue-2026-04-02-LTR日志准备&FM拟合效果初探-done-0402.md
docs/issue.md -> docs/issues/issue.md
docs/issue.txt -> docs/issues/issue.txt
@@ -0,0 +1,335 @@
+_QUERY_INTENT_ANALYSIS_TEMPLATE_EN = """You are an intent analysis expert for a fashion e-commerce search system.
+
+Given a user's search query, analyze the shopping intent behind the query in the context of fashion and apparel e-commerce, and summarize the user's core search need in one concise sentence.
+Also provide the Chinese translation and English translation of the query.
+
+Requirements:
+- Keep the intent analysis concise and easy to understand, using 1 to 3 short sentences.
+- Stay grounded in the original query and summarize the user's likely shopping intent without adding unnecessary context.
+- When the query is vague or ambiguous, take a conservative approach and keep the analysis close to the original wording.
+- Chinese translation: if the original query is already in Chinese, keep it unchanged.
+- English translation: if the original query is already in English, keep it unchanged.
+- Do not output anything other than the required three-line format.
+
+Output format (strictly exactly three lines):
+Intent: concise analysis of the user's search intent
+Query中文翻译: Chinese translation of the query
+Query English translation: English translation of the query
+
+Now analyze the following query:
+
+Query: {query}
+"""
+
+_QUERY_INTENT_ANALYSIS_RESULT_TEMPLATE_ZH = """
+你是一个服装品类电商搜索意图分析专家。
+
+给定用户输入的搜索词，请在服装品类电商场景下，分析该搜索词背后的购物意图，并用一句话简要描述用户的核心搜索需求。
+同时，提供该搜索词的中文翻译和英文翻译。
+
+要求：
+- 意图分析应简洁易懂，用 1 到 3 句短句概括用户的搜索意图。
+- 结合 query 本身，尽量贴近用户原始搜索需求进行总结，不添加不必要的背景、延伸或臆测。
+- 如果 query 不够明确或有歧义，应保守处理，尽量保持与原词表达一致。
+- 中文翻译：如果原始 query 本身就是中文，则按原样输出。
+- 英文翻译：如果原始 query 本身就是英文，则按原样输出。
+- 除指定格式外，不要输出任何额外说明。
+
+输出格式（严格按三行输出）：
+Intent: 对用户搜索意图的简洁分析
+Query中文翻译: query 的中文翻译
+Query English translation: query 的英文翻译
+
+现在请分析以下搜索词：
+
+Query: {query}
+"""
+
+_CLASSIFY_TEMPLATE_EN = """You are a relevance judgment assistant for a fashion e-commerce search system.
+Given a user query and the information for each product, assign a relevance label to each product.
+
+Your goal is to judge relevance from the perspective of e-commerce search ranking.
+The key question is whether the user would view the product as the intended item, or as an acceptable substitute.
+
+## Relevance Labels
+
+### Exact Match
+The product satisfies the user’s core shopping intent: the core product type matches, and all explicitly stated key attributes in the query are supported by the product information, with no obvious conflict.
+
+Typical use cases:
+- The query contains only a product type, and the product is exactly that type.
+- The query contains “product type + attributes”, and the product matches both the type and all explicitly stated attributes.
+
+### High Relevant
+The product satisfies the user’s main intent: the core product type matches, but some explicitly requested attributes are missing from the product information, cannot be confirmed, or show minor / non-critical deviations. The product is still a good substitute for the user’s core need.
+
+Use “High Relevant” in the following cases:
+- The core product type matches, but some requested attributes are missing, not mentioned, or cannot be verified.
+- The core product type matches, but attributes such as color, material, style, fit, or length have minor deviations, as long as the deviation does not materially undermine the user’s main shopping intent.
+- The product is not the user’s ideal target, but in an e-commerce shopping context, it would still be considered an acceptable and strong substitute.
+
+Typical examples:
+- Query: “red slim-fit T-shirt”
+  Product: “women’s T-shirt”
+  → Color and fit cannot be confirmed.
+- Query: “red slim-fit T-shirt”
+  Product: “blue slim-fit T-shirt”
+  → Product type and fit match, but the color is different.
+
+Detailed case:
+- Query: “cotton long-sleeve shirt”
+- Product: “J.VER Men's Linen Shirt Casual Button Down Long Sleeve Solid Plain Collar Summer Beach Shirt with Pocket”
+
+Analysis:
+- Material mismatch: the query explicitly requires “cotton”, while the product is “linen”, so it cannot be labeled as “Exact Match”.
+- However, the core category still matches: both are long-sleeve shirts.
+- In e-commerce search, users may still click this item because the style and wearing scenario are similar.
+- Therefore, it should be labeled as “High Relevant”: not the exact target, but a good substitute.
+
+Detailed case:
+- Query: “black mid-length skirt”
+- Product: “New spring autumn loose slimming full long floral skirt pleated skirt”
+
+Analysis:
+- Category match: the product is a skirt, so the category matches.
+- Color mismatch: the product description does not indicate black and explicitly mentions “floral”, which is substantially different from plain black.
+- Length deviation: the user asks for “mid-length”, while the product title emphasizes “long skirt”, which is somewhat longer.
+- However, the core category “skirt” still matches, and style features such as “slimming” and “full skirt” may still fit some preferences of users searching for a mid-length skirt. Also, “long” versus “mid-length” is a deviation, but not a severe contradiction.
+- Therefore, this should be labeled as “High Relevant”: the core type matches, but there are several non-fatal attribute deviations.
+
+### Low Relevant
+The product has a noticeable gap from the user’s core target, but still shares some similarity with the query in style, scenario, function, or broader category. A small portion of users may still view it as a barely acceptable substitute. It is not the intended item, but still has some relevance.
+
+Use “Low Relevant” in the following cases:
+- The core product type does not match, but the two types are still very close in style, wearing scenario, or function, so there is still some substitutability.
+- The core product type matches, but the product differs from the user’s ideal target on multiple attributes; it still has some relevance, but is no longer a strong substitute.
+- An important query requirement is clearly violated, but the product still retains a limited reason to be clicked.
+
+Typical cases:
+- Query: “black mid-length skirt”
+  Product: “New high-waisted V-neck mid-length dress elegant printed black sexy dress”
+  → The core product type differs (“skirt” vs “dress”), but both belong to closely related apparel types and share a similar mid-length style, so it is “Low Relevant”.
+
+- Query: “jeans”
+  Product: “casual pants”
+  → The core product type is different, but both belong to the broader pants category, and the style / wearing scenario may still be close enough to be a weak substitute.
+
+### Irrelevant
+The product does not satisfy the user’s main shopping intent, and the likelihood of user engagement is very low.
+
+Typical situations:
+- The core product type does not match the query and is not a close substitute in style, scenario, or function.
+- The product belongs to a roughly related broader category, but not to an interchangeable subtype explicitly requested in the query, and the style or usage scenario differs significantly.
+- The core product type matches, but the product clearly violates an explicit and important requirement in the query, with little or no acceptable substitutability.
+
+Typical examples:
+
+1. **Different core product type with no substitutability**
+- Query: "pants", Product: "shoes"
+- Query: "boots", Product: "sneakers"
+
+2. **Close core product type but clear conflict on key attributes**
+- Query: "slim-fit pants", Product: "loose wide-leg pants"
+- Query: "sleeveless dress", Product: "long-sleeve dress"
+- Query: "loose-fitting hoodie", Product: "slim-fit T-shirt"
+
+3. **Same broad category but significant difference in style or usage scenario**
+- Query: "jeans", Product: "sweatpants / dress pants"
+
+## Judgment Principles
+
+1. **Product type is the highest-priority factor.**
+   If the query explicitly specifies a concrete product type, the result must match that product type in order to be labeled as “Exact Match” or “High Relevant”.
+   Different product types should usually be labeled as “Low Relevant” or “Irrelevant”.
+
+   - **Low Relevant**: use only when the two product types are very close in style, scenario, or function, and the user may still treat one as a barely acceptable substitute for the other.
+   - **Irrelevant**: all other product type mismatch cases.
+
+2. **Similar or related product types are usually not directly interchangeable when the query is explicit, but their closeness should determine whether the label is “Low Relevant” or “Irrelevant”.**
+   For example:
+   - **May be Low Relevant due to strong similarity in style / scenario**: dress vs skirt, long skirt vs mid-length skirt, jeans vs casual pants, sneakers vs skate shoes.
+   - **Should be Irrelevant due to substantial difference in style / scenario**: pants vs shoes, T-shirt vs hat, boots vs sneakers, jeans vs suit pants, backpack vs handbag.
+
+3. **Once the core product type matches, evaluate attributes.**
+   - All explicit attributes match → **Exact Match**
+   - Some attributes are missing, not mentioned, cannot be verified, or show only minor deviations → **High Relevant**
+   - There are multiple attribute deviations, or an important attribute is clearly violated, but the product still retains some substitutability → **Low Relevant**
+   - There is a clear and important hard conflict, and substitutability is extremely low → **Irrelevant**
+
+4. **Strictly distinguish among “not mentioned / cannot confirm”, “minor deviation”, and “explicit contradiction”.**
+   - If an attribute is not mentioned or cannot be verified, prefer **High Relevant**.
+   - If an attribute shows a minor deviation, such as different color, different material, or slightly different length, it should usually be labeled **High Relevant**.
+   - If an attribute is explicitly opposite to the query requirement, such as sleeveless vs long-sleeve or slim-fit vs loose wide-leg, decide between **Low Relevant** and **Irrelevant** based on the severity of the conflict and practical substitutability.
+   - If the conflict directly breaks the user’s main shopping goal, it should usually be labeled **Irrelevant**.
+
+5. **Substitutability should be judged from real shopping intent, not just surface-level textual similarity.**
+   The question is whether the user would realistically accept the product in a shopping scenario.
+   - Good substitute → **High Relevant**
+   - Barely acceptable substitute → **Low Relevant**
+   - Hardly substitutable at all → **Irrelevant**
+
+6. **When product information is insufficient, do not treat “cannot confirm” as “conflict”.**
+   If a product does not mention an attribute, that does not mean the attribute is definitely violated.
+   Therefore:
+   - If the attribute is not mentioned or cannot be confirmed, prefer **High Relevant**;
+   - Only treat it as a conflict when the product information clearly shows the opposite of the query requirement.
+
+Query: {query}
+{intent_suffix}
+
+Products:
+{lines}
+
+## Output Format
+Output exactly {n} lines.
+Each line must be exactly one of the following:
+Exact Match
+High Relevant
+Low Relevant
+Irrelevant
+
+Now, assign a relevance label to each result with respect to the query "{query}".
+The output lines must correspond to the products above in the same order.
+Do not output anything else.
+"""
+
+_CLASSIFY_TEMPLATE_ZH = """你是一个服饰电商搜索系统中的相关性判断助手。
+给定用户查询词以及每个商品的信息，请为每个商品分配一个相关性标签。
+
+你的目标是从电商搜索排序的角度，判断商品是否满足用户的购物意图。
+判断时应优先考虑“用户是否会把该商品视为目标商品，或可接受的替代品”。
+
+## 相关性标签
+
+### 完全相关
+商品满足用户的核心购物意图：核心商品类型匹配，且查询中所有明确提及的关键属性均有商品信息支持。
+
+典型适用场景：
+- 查询仅包含商品类型，商品即为该类型。
+- 查询包含“商品类型 + 属性”，商品在类型及所有明确属性上均符合。
+
+### 基本相关
+商品满足用户的主要意图：核心商品类型匹配，但查询中明确提出的部分要求未在商品信息中体现、无法确认，或存在轻微偏差 / 非关键偏差。该商品仍是满足用户核心需求的良好替代品。
+
+在以下情况使用“基本相关”：
+- 核心商品类型匹配，但部分属性缺失、未提及或无法确认。
+- 核心商品类型匹配，但颜色、材质、风格、版型、长度等属性存在轻微偏差，只要这种偏差不会明显破坏用户的主要购买意图。
+- 商品不是用户最理想的目标，但在电商购物场景下仍可能被视为可接受、且较优的替代品。
+
+典型情况：
+- 查询：“红色修身T恤”，商品：“女士T恤”
+  → 颜色、版型无法确认。
+- 查询：“红色修身T恤”，商品：“蓝色修身T恤”
+  → 商品类型和版型匹配，但颜色不同。
+
+详细案例：
+- 查询：“棉质长袖衬衫”
+- 商品：“J.VER男式亚麻衬衫休闲纽扣长袖衬衫纯色平领夏季沙滩衬衫带口袋”
+
+分析：
+- 材质不符：Query 明确指定“棉质”，而商品为“亚麻”，因此不能判为“完全相关”。
+- 但核心品类仍然匹配：两者都是“长袖衬衫”。
+- 在电商搜索中，用户仍可能因为款式、穿着场景相近而点击该商品。
+- 因此应判为“基本相关”，即“非精确目标，但属于良好替代品”。
+
+详细案例：
+- 查询：“黑色中长半身裙”
+- 商品：“春秋季新款宽松显瘦大摆长裙碎花半身裙褶皱设计裙”
+
+分析：
+- 品类匹配：商品是“半身裙”，品类符合。
+- 颜色不匹配：商品描述未提及黑色，且明确包含“碎花”，与纯黑差异较大。
+- 长度存在偏差：用户要求“中长”，而商品标题强调“长裙”，长度偏长。
+- 但核心品类“半身裙”匹配，“显瘦”“大摆”等风格特征仍可能符合部分搜索“中长半身裙”用户的潜在偏好；同时“长裙”和“中长”虽有偏差，但不构成严重对立。
+- 因此应判为“基本相关”：核心品类匹配，但存在若干非致命属性偏差。
+
+### 弱相关
+商品与用户的核心目标存在明显差距，但仍与查询在风格、场景、功能或大类上具有一定相似性，可能被少量用户视为勉强可接受的替代品。属于“非目标，但仍有一定关联”。
+
+在以下情况使用“弱相关”：
+- 核心商品类型不一致，但两者在风格、穿着场景或功能上非常接近，仍具有一定替代性。
+- 核心商品类型匹配，但在多个属性上与用户理想目标差距较大，虽仍有一定关联性，但已不是高质量替代品。
+- 查询要求中的某个重要属性被明显违背，但商品仍保留少量被点击的理由。
+
+典型情况：
+- 查询：“黑色中长半身裙”，商品：“新款高腰V领中长款连衣裙 优雅印花黑色性感连衣裙”
+  → 核心商品类型“半身裙”与“连衣裙”不同，但两者同属裙装，且款式上均为“中长款”，在穿搭场景上接近，因此属于“弱相关”。
+
+- 查询：“牛仔裤”，商品：“休闲裤”
+  → 核心商品类型不同，但同属裤装大类，风格和穿着场景可能接近，可作为较弱替代品。
+
+### 不相关
+商品未满足用户的主要购物意图，用户点击动机极低。
+
+主要表现为以下情形之一：
+- 核心商品类型与查询不匹配，且不属于风格 / 场景 / 功能接近的可替代品。
+- 商品虽属于大致相关的大类，但与查询明确指定的具体子类不可互换，且风格或场景差异大。
+- 核心商品类型匹配，但商品明显违背了查询中一个明确且重要的要求，且几乎不具备可接受的替代性。
+
+典型情况：
+
+**1. 核心品类不同，且无替代性**
+- 查询：“裤子”，商品：“鞋子”
+- 查询：“靴子”，商品：“运动鞋”
+
+**2. 核心品类相近，但关键属性明显冲突**
+- 查询：“紧身裤”，商品：“阔腿裤”
+- 查询：“无袖连衣裙”，商品：“长袖连衣裙”
+- 查询：“宽松卫衣”，商品：“修身T恤”
+
+**3. 核心品类同属大类，但风格、场景差异巨大**
+- 查询：“牛仔裤”，商品：“运动裤 / 西裤”
+
+## 判断原则
+
+1. **商品类型是最高优先级因素。**
+   如果查询明确指定了具体商品类型，那么结果必须匹配该商品类型，才可能判为“完全相关”或“基本相关”。
+   不同商品类型通常应判为“弱相关”或“不相关”。
+
+   - **弱相关**：仅当两种商品类型在风格、场景、功能上非常接近，用户有一定概率将其视为勉强可接受的替代品时使用。
+   - **不相关**：其他所有商品类型不匹配的情况。
+
+2. **相似或相关的商品类型，在查询明确时通常不可直接互换，但要根据接近程度区分“弱相关”与“不相关”。**
+   例如：
+   - **风格 / 场景高度接近，可判为弱相关**：连衣裙 vs 半身裙、长裙 vs 中长裙、牛仔裤 vs 休闲裤、运动鞋 vs 板鞋。
+   - **风格 / 场景差异大，应判为不相关**：裤子 vs 鞋子、T恤 vs 帽子、靴子 vs 运动鞋、牛仔裤 vs 西装裤、双肩包 vs 手提包。
+
+3. **当核心商品类型匹配后，再评估属性。**
+   - 所有明确属性都匹配 → **完全相关**
+   - 部分属性缺失、未提及、无法确认，或存在轻微偏差 → **基本相关**
+   - 存在多个属性偏差，或某个重要属性被明显违背，但商品仍保留一定替代性 → **弱相关**
+   - 存在明确且重要的强冲突，且替代性极低 → **不相关**
+
+4. **要严格区分“未提及 / 无法确认”“轻微偏差”“明确冲突”。**
+   - 如果某属性没有提及，或无法验证，优先判为“基本相关”。
+   - 如果某属性存在轻微偏差，如颜色不同、材质不同、长度略有差异，通常判为“基本相关”。
+   - 如果某属性与查询要求明确相反，如无袖 vs 长袖、修身 vs 宽松阔腿，则要根据冲突严重性与替代性，在“弱相关”与“不相关”之间判断。
+   - 若该冲突会直接破坏用户的主要购买目标，通常判为“不相关”。
+
+5. **“是否可替代”应从真实电商购物意图出发判断。**
+   不是只看字面相似，而要看用户在购物场景下是否可能接受该商品。
+   - 良好替代品 → **基本相关**
+   - 勉强替代品 → **弱相关**
+   - 几乎不可替代 → **不相关**
+
+6. **若商品信息不足，不要把“无法确认”误判为“冲突”。**
+   商品未写明某属性，不等于该属性一定不符合。
+   因此：
+   - 未提及 / 无法确认，优先按“基本相关”处理；
+   - 只有当商品信息明确显示与查询要求相反时，才视为属性冲突。
+
+查询：{query}
+{intent_suffix}
+
+商品：
+{lines}
+
+## 输出格式
+严格输出 {n} 行，每行只能是以下四者之一：
+完全相关
+基本相关
+弱相关
+不相关
+
+现在请根据 query“{query}”，为每个结果标注相关性标签。输出行必须与上方商品顺序一一对应，不要输出任何其他内容。
+"""
 \ No newline at end of file
@@ -0,0 +1,153 @@
+一、主指标改为： 
+NDCG@20 + NDCG@50 + ERR@10 + Strong_Precision@10 + Strong_Precision@20 + Useful_Precision@50 + Avg_Grade@10 + Gain_Recall@20
+
+二、完善日志，阅读代码 理解rank漏斗，梳理重要信号，为LTR做准备 
+
+三、然后进行调参（最新的一版报告batch_20260402T054003Z_00b6a8aa3d.md是基线）：
+只针对重排融合公式（config.yaml)：
+  fusion:
+    es_bias: 10.0
+    es_exponent: 0.05
+    rerank_bias: 0.1
+    rerank_exponent: 1.15
+    fine_bias: 0.1
+    fine_exponent: 1.0
+    text_bias: 0.1
+    text_exponent: 0.25
+    text_translation_weight: 0.8
+    knn_text_weight: 1.0
+    knn_image_weight: 2.0
+    knn_tie_breaker: 0.3
+    knn_bias: 0.6
+    knn_exponent: 0.4
+
+主要考虑方向有：
+各项bias调整（调大）
+提高knn_image_weight
+提高knn_exponent
+提高es_exponent和降低text_exponent
+
+四、选择几条关键query，仔细观察日志，基于你的理解，进行优化。包括但不限于文本、图片knn召回的条数的调参。
+tassel maxi skirt
+minimalist top
+sock boots（有很多高相关的未召回）
+图案连身衣
+
+过程中要站在更高的视角理解rank体系，遇到不合理的地方可以重新设计和优化，积累的经验和工作流程要沉淀为文档。
+
+
+
+
+
+排序日志补强了rerank_client.py 和 searcher.py， 现在会输出 retrieval_plan、ltr_summary、以及 doc/stage 级 ltr_features，方便直接看 funnel，也方便后续做 LTR。参考_build_ltr_feature_block、_summarize_ltr_features等函数。做一版离线LTR拟合的工作（离线脚本验证）。
+思路是查看和解析特征日志./logs/backend_verbose.log 中正好有且只有评估框架最新跑的一轮特征日志，仔细思考rank体系、查看数据、提炼线索，做一版特征工程，label为4个级别，参考 [framework.py](scripts/evaluation/eval_framework/framework.py) ，损失使用RankNet Pairwise Loss（不需要deltandcg权重，先简单点，大体思路是一个query内任意两个label不同的组pair形成一个有效pair做二分类样本，请你找到简易的RankNet Pairwise Loss实现）。拟合函数考虑FM（要做一些简单的特征工程比如对每个数值型特征都做多种pow log等非线性映射发散出多个特征）
+
+
+
+
+
+需求：
+[检索调参与LTR工作流.md](docs/检索调参与LTR工作流.md) 
+The ranking logs have enhanced `rerank_client.py` and `searcher.py`, and now output `retrieval_plan`, `ltr_summary`, and doc/stage-level `ltr_features`, making it easy to view the funnel and facilitate subsequent LTR. Refer to functions such as `_build_ltr_feature_block` and `_summarize_ltr_features`. Perform an offline LTR fitting task (offline script validation).
+
+The idea is to inspect and parse the feature logs in `./logs/backend_verbose.log`, which contains exactly and only the latest round of feature logs from the evaluation framework. Carefully consider the ranking system, examine the data, extract insights, and perform feature engineering. The labels have 4 levels. Refer to `[framework.py](scripts/evaluation/eval_framework/framework.py)`. Use the RankNet Pairwise Loss (no need for deltaNDCG weight; keep it simple first. The general idea is to form a valid pair for each pair of items with different labels within a query as a binary classification sample. Find a simple implementation of RankNet Pairwise Loss). Consider FM as the fitting function (do some simple feature engineering, such as applying various nonlinear transformations like pow, log, etc., to each numerical feature to expand into multiple features).
+
+
+
+回答：
+Implemented an offline LTR fitter in [offline_ltr_fit.py](/data/saas-search/scripts/evaluation/offline_ltr_fit.py) that:
+
+- parses `logs/backend_verbose.log` `search_response` events
+- joins per-query `spu_id` rows with 4-level labels from `artifacts/search_evaluation/search_eval.sqlite3`
+- builds engineered features from `ltr_features`, stage scores, ranks, and nonlinear transforms like `log1p`, `sqrt`, square, inverse, plus a few ratio/cross features
+- trains an FM with RankNet-style pairwise logistic loss
+- runs query-group 5-fold offline validation
+- saves the fitted model and predictions as artifacts
+
+I validated it end-to-end with:
+```bash
+./.venv/bin/python scripts/evaluation/offline_ltr_fit.py
+```
+
+Latest run artifacts are in [summary.json](/data/saas-search/artifacts/search_evaluation/ltr_runs/offline_ltr_20260402T074937Z/summary.json) and [fm_ranknet_model.json](/data/saas-search/artifacts/search_evaluation/ltr_runs/offline_ltr_20260402T074937Z/fm_ranknet_model.json).
+
+Key result: cross-validated `Primary_Metric_Score` improved from `0.641844` for current fused ranking to `0.654043` for the FM model. On the full-data refit, the model reaches `0.752009`, which is optimistic in-sample but shows the signal is learnable. The training set used `54` queries, `5400` docs, and `154592` valid label-different pairs.
+
+
+
+
+需求：
+1. The essence of FM is the weights of second-order interactions. Please output the visualized feature names (including interaction features) and their corresponding weights, sorted by importance (you can decide how to measure the importance of weights; don't make it too complicated. If no better method, simply use the absolute value of the weight).  
+2. It is necessary to split 10 results as a test set to measure generalization performance, so as to determine whether overfitting has occurred.
+
+
+回答：
+Updated [offline_ltr_fit.py](/data/saas-search/scripts/evaluation/offline_ltr_fit.py) to do both:
+
+- export FM feature importance
+- reserve a dedicated 10-query holdout test set for generalization
+
+I interpreted “10 results as a test set” as `10 queries`, because ranking generalization should be measured at the query-group level, not by splitting docs from the same query across train/test.
+
+Artifacts from the latest run are under:
+- [summary.json](/data/saas-search/artifacts/search_evaluation/ltr_runs/offline_ltr_20260402T080116Z/summary.json)
+- [feature_importance_linear.csv](/data/saas-search/artifacts/search_evaluation/ltr_runs/offline_ltr_20260402T080116Z/feature_importance_linear.csv)
+- [feature_importance_interactions.csv](/data/saas-search/artifacts/search_evaluation/ltr_runs/offline_ltr_20260402T080116Z/feature_importance_interactions.csv)
+- [fm_ranknet_model.json](/data/saas-search/artifacts/search_evaluation/ltr_runs/offline_ltr_20260402T080116Z/fm_ranknet_model.json)
+
+Key results:
+- Cross-validation `Primary_Metric_Score`: `0.654043`
+- Current fused baseline: `0.641844`
+- 10-query holdout FM: `0.53056`
+- 10-query holdout current fused baseline: `0.5674`
+
+So the FM looks promising in CV, but the 10-query holdout is worse than the current baseline, which is a real overfitting/generalization warning.
+
+Top linear features by absolute weight:
+- `text_knn_score__square` → `0.9936`
+- `knn_primary_score__square` → `0.9529`
+- `has_translation_match` → `-0.9309`
+- `knn_score__square` → `0.9224`
+- `text_support_score__square` → `0.8481`
+
+Top interaction features by absolute weight:
+- `text_score_fallback_to_es * initial_rank__log` → `0.7099`
+- `text_support_score__log1p * initial_rank__log` → `0.6814`
+- `text_knn_score__square * initial_rank__log` → `-0.6419`
+- `has_text_knn * initial_rank__log` → `0.5879`
+- `translation_share * source_share` → `0.5509`
+
+The 10 holdout queries were:
+- `date night dress`
+- `skiing trip insulated base layer`
+- `streetwear t-shirt`
+- `swim dress`
+- `tall slim fit trousers`
+- `tassel maxi skirt`
+- `чёрное летнее платье`
+- `فستان أسود متوسط الطول`
+- `图案连身衣`
+- `宽松开襟羊毛衫`
+
+Next useful step is pruning the feature space and reducing the heavy dependence on rank-derived interactions like `initial_rank__log`, then rerunning the same holdout test.
+
+
+
+已把 LTR 相关内容系统性补充进 [检索调参与LTR工作流.md](/data/saas-search/docs/检索调参与LTR工作流.md)。
+
+把文档扩成了更适合正式立项的版本，重点补了这些层面：
+
+- LTR 项目的数据闭环：日志特征、离线标签、样本构造、训练、验证、上线前判断
+- 当前日志结构如何支撑 LTR：`retrieval_plan`、`ltr_features`、`ranking_funnel` 分别提供什么信息
+- 样本定义：为什么基础单元是 `query-doc`，训练时再转成 pair
+- label 获取与边界：`search_eval.sqlite3`、4 档标签语义、当前标签体系的适用范围和风险
+- 特征工程思路：原始特征、非线性变换、比例特征、交叉前的人工组合特征
+- 模型与 loss 选择：为什么第一版用 FM + RankNet Pairwise Loss
+- 当前离线实验结果解读：CV 提升、10-query holdout 掉分、这意味着什么
+- FM 权重如何解释：一阶权重、二阶交互权重、如何用 importance 指导后续特征裁剪
+- 正式开展 LTR 项目的建议路线：数据资产固化、feature ablation、模型升级、上线节奏
+
+如果你愿意，我下一步可以继续把这份文档再往“可执行方案”推进一步，补一节：
+- `训练样本落盘格式设计`
+- `线上/离线特征一致性约束`
+- `LTR 服务接入 searcher/rerank 的上线方案`