scripts/evaluation/eval_framework/prompts.py

"""LLM prompt templates for relevance judging (keep wording changes here)."""
from __future__ import annotations
import json
from typing import Any, Dict, Sequence
_CLASSIFY_BATCH_SIMPLE_TEMPLATE = """You are a relevance evaluation assistant for an apparel e-commerce search system.
Given the user query and each product's information, assign one relevance label to each product.
## Relevance Labels
### Exact
The product fully satisfies the user’s search intent: the core product type matches, all explicitly stated key attributes are supported by the product information.
Typical use cases:
- The query contains only a product type, and the product is exactly that type.
- The query contains product type + attributes, and the product matches both the type and all explicitly stated attributes.
### Partial
The product satisfies the user's primary intent because the core product type matches, but some explicit requirements in the query are missing, cannot be confirmed, or deviate from the query. Despite the mismatch, the product can still be considered a non-target but acceptable substitute.
Use Partial when:
- The core product type matches, but some requested attributes cannot be confirmed.
- The core product type matches, but some secondary requirements deviate or are inconsistent.
- The product is not the ideal target, but it is still a plausible and acceptable substitute for the shopper.
Typical cases:
- Query: "red fitted t-shirt", product: "Women's T-Shirt" → color/fit cannot be confirmed.
- Query: "red fitted t-shirt", product: "Blue Fitted T-Shirt" → product type and fit match, but color differs.
Detailed example:
- Query: "cotton long sleeve shirt"
- Product: "J.VER Men's Linen Shirts Casual Button Down Long Sleeve Shirt Solid Spread Collar Summer Beach Shirts with Pocket"
Analysis:
- Material mismatch: the query explicitly requires cotton, while the product is linen, so it cannot be Exact.
- However, the core product type still matches: both are long sleeve shirts.
- In an e-commerce setting, the shopper may still consider clicking this item because the style and use case are similar.
- Therefore, it should be labeled Partial as a non-target but acceptable substitute.
### Irrelevant
The product does not satisfy the user's main shopping intent.
Use Irrelevant when:
- The core product type does not match the query.
- The product belongs to a broadly related category, but not the specific product subtype requested, and shoppers would not consider them interchangeable.
- The core product type matches, but the product clearly contradicts an explicit and important requirement in the query.
Typical cases:
- Query: "pants", product: "shoes" → wrong product type.
- Query: "dress", product: "skirt" → different product type.
- Query: "fitted pants", product: "loose wide-leg pants" → explicit contradiction on fit.
- Query: "sleeveless dress", product: "long sleeve dress" → explicit contradiction on sleeve style.
This label emphasizes clarity of user intent. When the query specifies a concrete product type or an important attribute, products that conflict with that intent should be judged Irrelevant even if they are related at a higher category level.
## Decision Principles
1. Product type is the highest-priority factor.
   If the query clearly specifies a concrete product type, the result must match that product type to be Exact or Partial.
   A different product type is usually Irrelevant, not Partial.
2. Similar or related product types are not interchangeable when the query is specific.
   For example:
   - dress vs skirt vs jumpsuit
   - jeans vs pants
   - t-shirt vs blouse
   - cardigan vs sweater
   - boots vs shoes
   - bra vs top
   - backpack vs bag
   If the user explicitly searched for one of these, the others should usually be judged Irrelevant.
3. If the core product type matches, then evaluate attributes.
   - If all explicit attributes match → Exact
   - If some attributes are missing, uncertain, or partially mismatched, but the item is still an acceptable substitute → Partial
   - If an explicit and important attribute is clearly contradicted, and the item is not a reasonable substitute → Irrelevant
4. Distinguish carefully between "not mentioned" and "contradicted".
   - If an attribute is not mentioned or cannot be verified, prefer Partial.
   - If an attribute is explicitly opposite to the query, use Irrelevant unless the item is still reasonably acceptable as a substitute under the shopping context.
Query: {query}
Products:
{lines}
## Output Format
Strictly output {n} lines, each line containing exactly one of:
Exact
Partial
Irrelevant
The lines must correspond sequentially to the products above.
Do not output any other information.
"""
_CLASSIFY_BATCH_SIMPLE_TEMPLATE_ZH = """你是一个服饰电商搜索系统中的相关性判断助手。
给定用户查询词以及每个商品的信息，请为每个商品分配一个相关性标签。
## 相关性标签
### 完全相关
核心产品类型匹配，所有明确提及的关键属性均有产品信息支撑。
典型适用场景：
- 查询仅包含产品类型，产品即为该类型。
- 查询包含“产品类型 + 属性”，产品在类型及所有明确属性上均符合。
### 基本相关 (High Relevant)
产品满足用户的主要意图（核心产品类型匹配），但查询中明确的部分要求未在产品信息中体现、无法确认，或存在并不严重冲突的偏差。该商品是满足用户核心需求的良好替代品。
在以下情况使用部分相关：
- 核心产品类型匹配，但部分请求的属性在商品信息中缺失、未提及或无法确认。
- 核心产品类型匹配，但材质、版型、风格等次要要求存在偏差或不一致。
- 商品不是用户最理想的目标，但从电商购物角度看，仍可能被用户视为可接受的替代品。
典型情况：
- 查询：“红色修身T恤”，产品：“女士T恤” → 颜色/版型无法确认。
- 查询：“红色修身T恤”，产品：“蓝色修身T恤” → 产品类型和版型匹配，但颜色不同。
详细案例：
- 查询：“棉质长袖衬衫”
- 商品：“J.VER男式亚麻衬衫休闲纽扣长袖衬衫纯色平领夏季沙滩衬衫带口袋”
分析：
- 材质不符：Query 明确指定“棉质”，而商品为“亚麻”，因此不能判为完全相关。
- 但核心品类仍然匹配：两者都是“长袖衬衫”。
- 在电商搜索中，用户仍可能因为款式、穿着场景相近而点击该商品。
- 因此应判为部分相关，即“非目标但可接受”的替代品。
详细案例：
- 查询：“黑色中长半身裙”
- 商品：“春秋季新款宽松显瘦大摆长裙碎花半身裙褶皱设计裙”
分析：
- 品类匹配：产品是“半身裙”，品类符合。
- 颜色不匹配：产品描述未提及黑色且明确包含“碎花”（floral），花色与纯黑差异较大。
- 长度存在偏差：用户要求“中长”，而产品标题强调“长裙”（Long Skirt），长度偏长。
- 核心品类“半身裙”匹配，且“显瘦”“大摆”等风格可能符合部分搜索“中长半身裙”用户的潜在偏好（如版型相似），“长裙”和“中长”无严重矛盾，属于核心品类匹配，属性存在不严重偏差的“基本相关”。
### 弱相关 (Low Relevant)
产品与用户的核心意图存在差距，主要表现为以下情形之一，但仍可能因风格、场景或功能上的相似性而被用户接受。为“非目标但可接受”的替代品。
- **典型情况**：
  - 核心产品类型有差异，但风格、穿着场景或功能非常接近，如查询“黑色中长半身裙”，商品为“连衣裙”（同属裙装大类，款式相似）。
  - 核心产品类型有差异，但在购物场景下属于相近品类，可勉强替代，如查询“牛仔裤”，商品为“休闲裤”（均为裤子大类，风格可能相近）。
  - 核心产品类型匹配，但产品在多个非关键属性上存在偏差，导致与用户理想目标差距较大，但仍保留一定关联性。
典型情况：
- 查询：“黑色中长半身裙”，产品：“新款高腰V领中长款连衣裙 优雅印花黑色性感连衣裙” → 核心产品类型“半身裙”与“连衣裙”有差异，但两者同属裙装大类且款式上均为“中长款”，具有相似性。
### 不相关 (Irrelevant)
产品未满足用户的主要购物意图，用户点击动机极低。主要表现为以下情形之一：
- 核心产品类型与查询不匹配，且不属于风格/场景相近的替代品。
- 产品虽属大致相关的大类，但与查询指定的具体子类不可互换，且风格/场景差异大。
- 核心产品类型匹配，但产品明显违背了查询中一个明确且重要的属性要求，且不存在可接受的理由。
典型情况：
- 查询：“裤子”，产品：“鞋子” → 产品类型错误。
- 查询：“修身裤”，产品：“宽松阔腿裤” → 与版型要求明显冲突，替代性极低。
- 查询：“无袖连衣裙”，产品：“长袖连衣裙” → 与袖型要求明显冲突。
- 查询：“牛仔裤”，产品：“运动裤” → 核心品类不同（牛仔裤 vs 运动裤），风格和场景差异大。
- 查询：“靴子”，产品：“运动鞋” → 核心品类不同，功能和适用场景差异大。
## 判断原则
1.  **产品类型是最高优先级因素。**
    如果查询明确指定了具体产品类型，那么结果必须匹配该产品类型，才可能判为“完全相关”或“基本相关”。不同产品类型通常应判为“弱相关”或“不相关”。
    -   **弱相关**：仅当两种产品类型风格、场景、功能非常接近，可能被视为可接受的替代品时使用。
    -   **不相关**：其他所有产品类型不匹配的情况。
2.  **相似或相关的产品类型，在查询明确时通常不可互换，但需根据接近程度区分。**
    例如：
    -   **风格/场景高度接近，可判为弱相关**：连衣裙 vs 半身裙、长裙 vs 中长裙、牛仔裤 vs 休闲裤、运动鞋 vs 板鞋。
    -   **风格/场景差异大，判为不相关**：裤子 vs 鞋子、T恤 vs 帽子、靴子 vs 运动鞋、牛仔裤 vs 西装裤、双肩包 vs 手提包。
    如果用户明确搜索其中一种，其他类型是否可接受取决于其风格、场景的接近程度。
3.  **当核心产品类型匹配后，再评估属性。**
    -   所有明确属性都匹配 → **完全相关**
    -   部分属性缺失、无法确认，或存在较小偏差 → **基本相关**
    -   明确且重要的属性被明显违背（如修身 vs 宽松），但核心品类仍匹配 → **弱相关** 或 **不相关**。
        -   **弱相关**：属性明显违背，但存在可被用户接受的微弱理由（如版型虽不同但风格类似）。
        -   **不相关**：属性明显违背，且替代性极低，用户无点击动机（如修身 vs 宽松阔腿裤）。
4.  **要严格区分“未提及/无法确认”、“较小偏差”、“明确冲突”。**
    -   如果某属性没有提及，或无法验证，优先判为“**基本相关**”。
    -   如果某属性存在较小偏差（如颜色不同、材质不同），判为“**基本相关**”。
    -   如果某属性与查询要求明确相反，则需根据冲突的严重性和替代性判为“**弱相关**”或“**不相关**”。
查询：{query}
商品：
{lines}
## 输出格式
严格输出 {n} 行，每行只能是以下三者之一：
完全相关
部分相关
不相关
输出行必须与上方商品顺序一一对应。
不要输出任何其他内容。
"""
def classify_batch_simple_prompt(query: str, numbered_doc_lines: Sequence[str]) -> str:
    lines = "\n".join(numbered_doc_lines)
    n = len(numbered_doc_lines)
    return _CLASSIFY_BATCH_SIMPLE_TEMPLATE.format(query=query, lines=lines, n=n)
_EXTRACT_QUERY_PROFILE_TEMPLATE = """You are building a structured intent profile for e-commerce relevance judging.
Use the original user query as the source of truth. Parser hints may help, but if a hint conflicts with the original query, trust the original query.
Be conservative: only mark an attribute as required if the user explicitly asked for it.
Return JSON with this schema:
{{
  "normalized_query_en": string,
  "primary_category": string,
  "allowed_categories": [string],
  "required_attributes": [
    {{"name": string, "required_terms": [string], "conflicting_terms": [string], "match_mode": "explicit"}}
  ],
  "notes": [string]
}}
Guidelines:
- Exact later will require explicit evidence for all required attributes.
- allowed_categories should contain only near-synonyms of the same product type, not substitutes. For example dress can allow midi dress/cocktail dress, but not skirt, top, jumpsuit, or outfit unless the query explicitly asks for them.
- If the query asks for dress/skirt/jeans/t-shirt, near but different product types are not Exact.
- If the query includes color, fit, silhouette, or length, include them as required_attributes.
- For fit words, include conflicting terms when obvious, e.g. fitted conflicts with oversized/loose; oversized conflicts with fitted/tight.
- For color, include conflicting colors only when clear from the query.
Original query: {query}
Parser hints JSON: {hints_json}
"""
def extract_query_profile_prompt(query: str, parser_hints: Dict[str, Any]) -> str:
    hints_json = json.dumps(parser_hints, ensure_ascii=False)
    return _EXTRACT_QUERY_PROFILE_TEMPLATE.format(query=query, hints_json=hints_json)
_CLASSIFY_BATCH_COMPLEX_TEMPLATE = """You are an e-commerce search relevance judge.
Judge each product against the structured query profile below.
Relevance rules:
- Exact: product type matches the target intent, and every explicit required attribute is positively supported by the title/options/tags/category. If an attribute is missing or only guessed, it is NOT Exact.
- Partial: main product type/use case matches, but some required attribute is missing, weaker, uncertain, or only approximately matched.
- Irrelevant: product type/use case mismatched, or an explicit required attribute clearly conflicts.
- Be conservative with Exact.
- Graphic/holiday/message tees are not Exact for a plain color/style tee query unless that graphic/theme was requested.
- Jumpsuit/romper/set is not Exact for dress/skirt/jeans queries.
Original query: {query}
Structured query profile JSON: {profile_json}
Products:
{lines}
Return JSON only, with schema:
{{"labels":[{{"index":1,"label":"Exact","reason":"short phrase"}}]}}
"""
def classify_batch_complex_prompt(
    query: str,
    query_profile: Dict[str, Any],
    numbered_doc_lines: Sequence[str],
) -> str:
    lines = "\n".join(numbered_doc_lines)
    profile_json = json.dumps(query_profile, ensure_ascii=False)
    return _CLASSIFY_BATCH_COMPLEX_TEMPLATE.format(
        query=query,
        profile_json=profile_json,
        lines=lines,
    )