scripts/evaluation/eval_framework/prompts.py

"""LLM prompt templates for relevance judging (keep wording changes here)."""
from __future__ import annotations
import json
from typing import Any, Dict, Sequence
_CLASSIFY_BATCH_SIMPLE_TEMPLATE = """You are a relevance evaluation assistant for an apparel e-commerce search system.
Given the user query and each product's information, assign one relevance label to each product.
## Relevance Labels
### Exact
The product fully satisfies the user’s search intent: the core product type matches, all explicitly stated key attributes are supported by the product information.
Typical use cases:
- The query contains only a product type, and the product is exactly that type.
- The query contains product type + attributes, and the product matches both the type and all explicitly stated attributes.
### Partial
The product satisfies the user's primary intent because the core product type matches, but some explicit requirements in the query are missing, cannot be confirmed, or deviate from the query. Despite the mismatch, the product can still be considered a non-target but acceptable substitute.
Use Partial when:
- The core product type matches, but some requested attributes cannot be confirmed.
- The core product type matches, but some secondary requirements deviate or are inconsistent.
- The product is not the ideal target, but it is still a plausible and acceptable substitute for the shopper.
Typical cases:
- Query: "red fitted t-shirt", product: "Women's T-Shirt" → color/fit cannot be confirmed.
- Query: "red fitted t-shirt", product: "Blue Fitted T-Shirt" → product type and fit match, but color differs.
Detailed example:
- Query: "cotton long sleeve shirt"
- Product: "J.VER Men's Linen Shirts Casual Button Down Long Sleeve Shirt Solid Spread Collar Summer Beach Shirts with Pocket"
Analysis:
- Material mismatch: the query explicitly requires cotton, while the product is linen, so it cannot be Exact.
- However, the core product type still matches: both are long sleeve shirts.
- In an e-commerce setting, the shopper may still consider clicking this item because the style and use case are similar.
- Therefore, it should be labeled Partial as a non-target but acceptable substitute.
### Irrelevant
The product does not satisfy the user's main shopping intent.
Use Irrelevant when:
- The core product type does not match the query.
- The product belongs to a broadly related category, but not the specific product subtype requested, and shoppers would not consider them interchangeable.
- The core product type matches, but the product clearly contradicts an explicit and important requirement in the query.
Typical cases:
- Query: "pants", product: "shoes" → wrong product type.
- Query: "dress", product: "skirt" → different product type.
- Query: "fitted pants", product: "loose wide-leg pants" → explicit contradiction on fit.
- Query: "sleeveless dress", product: "long sleeve dress" → explicit contradiction on sleeve style.
This label emphasizes clarity of user intent. When the query specifies a concrete product type or an important attribute, products that conflict with that intent should be judged Irrelevant even if they are related at a higher category level.
## Decision Principles
1. Product type is the highest-priority factor.
   If the query clearly specifies a concrete product type, the result must match that product type to be Exact or Partial.
   A different product type is usually Irrelevant, not Partial.
2. Similar or related product types are not interchangeable when the query is specific.
   For example:
   - dress vs skirt vs jumpsuit
   - jeans vs pants
   - t-shirt vs blouse
   - cardigan vs sweater
   - boots vs shoes
   - bra vs top
   - backpack vs bag
   If the user explicitly searched for one of these, the others should usually be judged Irrelevant.
3. If the core product type matches, then evaluate attributes.
   - If all explicit attributes match → Exact
   - If some attributes are missing, uncertain, or partially mismatched, but the item is still an acceptable substitute → Partial
   - If an explicit and important attribute is clearly contradicted, and the item is not a reasonable substitute → Irrelevant
4. Distinguish carefully between "not mentioned" and "contradicted".
   - If an attribute is not mentioned or cannot be verified, prefer Partial.
   - If an attribute is explicitly opposite to the query, use Irrelevant unless the item is still reasonably acceptable as a substitute under the shopping context.
Query: {query}
Products:
{lines}
## Output Format
Strictly output {n} lines, each line containing exactly one of:
Exact
Partial
Irrelevant
The lines must correspond sequentially to the products above.
Do not output any other information.
"""
_CLASSIFY_BATCH_SIMPLE_TEMPLATE__zh = _CLASSIFY_BATCH_SIMPLE_TEMPLATE_ZH = """你是一个服饰电商搜索系统中的相关性判断助手。
给定用户查询词以及每个商品的信息，请为每个商品分配一个相关性标签。
## 相关性标签
### 完全相关
核心产品类型匹配，所有明确提及的关键属性均有产品信息支撑。
典型适用场景：
- 查询仅包含产品类型，产品即为该类型。
- 查询包含“产品类型 + 属性”，产品在类型及所有明确属性上均符合。
### 部分相关
产品满足用户的主要意图（核心产品类型匹配），但查询中明确的部分要求未体现，或存在偏差。虽然有不一致，但仍属于“非目标但可接受”的替代品。
在以下情况使用部分相关：
- 核心产品类型匹配，但部分请求的属性在商品信息中缺失、未提及或无法确认。
- 核心产品类型匹配，但材质、版型、风格等次要要求存在偏差或不一致。
- 商品不是用户最理想的目标，但从电商购物角度看，仍可能被用户视为可接受的替代品。
典型情况：
- 查询：“红色修身T恤”，产品：“女士T恤” → 颜色/版型无法确认。
- 查询：“红色修身T恤”，产品：“蓝色修身T恤” → 产品类型和版型匹配，但颜色不同。
详细案例：
- 查询：“棉质长袖衬衫”
- 商品：“J.VER男式亚麻衬衫休闲纽扣长袖衬衫纯色平领夏季沙滩衬衫带口袋”
分析：
- 材质不符：Query 明确指定“棉质”，而商品为“亚麻”，因此不能判为完全相关。
- 但核心品类仍然匹配：两者都是“长袖衬衫”。
- 在电商搜索中，用户仍可能因为款式、穿着场景相近而点击该商品。
- 因此应判为部分相关，即“非目标但可接受”的替代品。
### 不相关
产品未满足用户的主要购物意图，主要表现为以下情形之一：
- 核心产品类型与查询不匹配。
- 产品虽属大致相关的大类，但与查询指定的具体子类不可互换。
- 核心产品类型匹配，但产品明显违背了查询中一个明确且重要的属性要求。
典型情况：
- 查询：“裤子”，产品：“鞋子” → 产品类型错误。
- 查询：“连衣裙”，产品：“半身裙” → 具体产品类型不同。
- 查询：“修身裤”，产品：“宽松阔腿裤” → 与版型要求明显冲突。
- 查询：“无袖连衣裙”，产品：“长袖连衣裙” → 与袖型要求明显冲突。
该标签强调用户意图的明确性。当查询指向具体类型或关键属性时，即使产品在更高层级类别上相关，也应按不相关处理。
## 判断原则
1. 产品类型是最高优先级因素。
   如果查询明确指定了具体产品类型，那么结果必须匹配该产品类型，才可能判为“完全相关”或“部分相关”。
   不同产品类型通常应判为“不相关”，而不是“部分相关”。
2. 相似或相关的产品类型，在查询明确时通常不可互换。
   例如：
   - 连衣裙 vs 半身裙 vs 连体裤
   - 牛仔裤 vs 裤子
   - T恤 vs 衬衫/上衣
   - 开衫 vs 毛衣
   - 靴子 vs 鞋子
   - 文胸 vs 上衣
   - 双肩包 vs 包
   如果用户明确搜索其中一种，其他类型通常应判为“不相关”。
3. 当核心产品类型匹配后，再评估属性。
   - 所有明确属性都匹配 → 完全相关
   - 部分属性缺失、无法确认，或存在一定偏差，但仍是可接受替代品 → 部分相关
   - 明确且重要的属性被明显违背，且不能作为合理替代品 → 不相关
4. 要严格区分“未提及/无法确认”和“明确冲突”。
   - 如果某属性没有提及，或无法验证，优先判为“部分相关”。
   - 如果某属性与查询要求明确相反，则判为“不相关”；除非在购物语境下它仍明显属于可接受替代品。
查询：{query}
商品：
{lines}
## 输出格式
严格输出 {n} 行，每行只能是以下三者之一：
完全相关
部分相关
不相关
输出行必须与上方商品顺序一一对应。
不要输出任何其他内容。
"""
def classify_batch_simple_prompt(query: str, numbered_doc_lines: Sequence[str]) -> str:
    lines = "\n".join(numbered_doc_lines)
    n = len(numbered_doc_lines)
    return _CLASSIFY_BATCH_SIMPLE_TEMPLATE.format(query=query, lines=lines, n=n)
_EXTRACT_QUERY_PROFILE_TEMPLATE = """You are building a structured intent profile for e-commerce relevance judging.
Use the original user query as the source of truth. Parser hints may help, but if a hint conflicts with the original query, trust the original query.
Be conservative: only mark an attribute as required if the user explicitly asked for it.
Return JSON with this schema:
{{
  "normalized_query_en": string,
  "primary_category": string,
  "allowed_categories": [string],
  "required_attributes": [
    {{"name": string, "required_terms": [string], "conflicting_terms": [string], "match_mode": "explicit"}}
  ],
  "notes": [string]
}}
Guidelines:
- Exact later will require explicit evidence for all required attributes.
- allowed_categories should contain only near-synonyms of the same product type, not substitutes. For example dress can allow midi dress/cocktail dress, but not skirt, top, jumpsuit, or outfit unless the query explicitly asks for them.
- If the query asks for dress/skirt/jeans/t-shirt, near but different product types are not Exact.
- If the query includes color, fit, silhouette, or length, include them as required_attributes.
- For fit words, include conflicting terms when obvious, e.g. fitted conflicts with oversized/loose; oversized conflicts with fitted/tight.
- For color, include conflicting colors only when clear from the query.
Original query: {query}
Parser hints JSON: {hints_json}
"""
def extract_query_profile_prompt(query: str, parser_hints: Dict[str, Any]) -> str:
    hints_json = json.dumps(parser_hints, ensure_ascii=False)
    return _EXTRACT_QUERY_PROFILE_TEMPLATE.format(query=query, hints_json=hints_json)
_CLASSIFY_BATCH_COMPLEX_TEMPLATE = """You are an e-commerce search relevance judge.
Judge each product against the structured query profile below.
Relevance rules:
- Exact: product type matches the target intent, and every explicit required attribute is positively supported by the title/options/tags/category. If an attribute is missing or only guessed, it is NOT Exact.
- Partial: main product type/use case matches, but some required attribute is missing, weaker, uncertain, or only approximately matched.
- Irrelevant: product type/use case mismatched, or an explicit required attribute clearly conflicts.
- Be conservative with Exact.
- Graphic/holiday/message tees are not Exact for a plain color/style tee query unless that graphic/theme was requested.
- Jumpsuit/romper/set is not Exact for dress/skirt/jeans queries.
Original query: {query}
Structured query profile JSON: {profile_json}
Products:
{lines}
Return JSON only, with schema:
{{"labels":[{{"index":1,"label":"Exact","reason":"short phrase"}}]}}
"""
def classify_batch_complex_prompt(
    query: str,
    query_profile: Dict[str, Any],
    numbered_doc_lines: Sequence[str],
) -> str:
    lines = "\n".join(numbered_doc_lines)
    profile_json = json.dumps(query_profile, ensure_ascii=False)
    return _CLASSIFY_BATCH_COMPLEX_TEMPLATE.format(
        query=query,
        profile_json=profile_json,
        lines=lines,
    )