Blame view

scripts/evaluation/eval_framework/prompts.py 12.1 KB
c81b0fc1   tangwang   scripts/evaluatio...
1
2
3
4
5
6
7
  """LLM prompt templates for relevance judging (keep wording changes here)."""
  
  from __future__ import annotations
  
  import json
  from typing import Any, Dict, Sequence
  
3ac1f8d1   tangwang   评估标准优化
8
  _CLASSIFY_BATCH_SIMPLE_TEMPLATE = """You are a relevance evaluation assistant for an apparel e-commerce search system.
3984ec64   tangwang   evalution 标注标准优化
9
10
11
12
13
  Given the user query and each product's information, assign one relevance label to each product.
  
  ## Relevance Labels
  
  ### Exact
3ac1f8d1   tangwang   评估标准优化
14
  The product fully satisfies the users search intent: the core product type matches, all explicitly stated key attributes are supported by the product information.
3984ec64   tangwang   evalution 标注标准优化
15
  
3ac1f8d1   tangwang   评估标准优化
16
17
18
  Typical use cases:
  - The query contains only a product type, and the product is exactly that type.
  - The query contains product type + attributes, and the product matches both the type and all explicitly stated attributes.
3984ec64   tangwang   evalution 标注标准优化
19
20
  
  ### Partial
3ac1f8d1   tangwang   评估标准优化
21
  The product satisfies the user's primary intent because the core product type matches, but some explicit requirements in the query are missing, cannot be confirmed, or deviate from the query. Despite the mismatch, the product can still be considered a non-target but acceptable substitute.
3984ec64   tangwang   evalution 标注标准优化
22
23
24
  
  Use Partial when:
  - The core product type matches, but some requested attributes cannot be confirmed.
3ac1f8d1   tangwang   评估标准优化
25
26
  - The core product type matches, but some secondary requirements deviate or are inconsistent.
  - The product is not the ideal target, but it is still a plausible and acceptable substitute for the shopper.
3984ec64   tangwang   evalution 标注标准优化
27
28
29
30
  
  Typical cases:
  - Query: "red fitted t-shirt", product: "Women's T-Shirt"  color/fit cannot be confirmed.
  - Query: "red fitted t-shirt", product: "Blue Fitted T-Shirt"  product type and fit match, but color differs.
3984ec64   tangwang   evalution 标注标准优化
31
  
3ac1f8d1   tangwang   评估标准优化
32
33
34
35
36
37
38
39
40
  Detailed example:
  - Query: "cotton long sleeve shirt"
  - Product: "J.VER Men's Linen Shirts Casual Button Down Long Sleeve Shirt Solid Spread Collar Summer Beach Shirts with Pocket"
  
  Analysis:
  - Material mismatch: the query explicitly requires cotton, while the product is linen, so it cannot be Exact.
  - However, the core product type still matches: both are long sleeve shirts.
  - In an e-commerce setting, the shopper may still consider clicking this item because the style and use case are similar.
  - Therefore, it should be labeled Partial as a non-target but acceptable substitute.
3984ec64   tangwang   evalution 标注标准优化
41
42
43
44
45
46
  
  ### Irrelevant
  The product does not satisfy the user's main shopping intent.
  
  Use Irrelevant when:
  - The core product type does not match the query.
3ac1f8d1   tangwang   评估标准优化
47
  - The product belongs to a broadly related category, but not the specific product subtype requested, and shoppers would not consider them interchangeable.
3984ec64   tangwang   evalution 标注标准优化
48
49
50
51
52
53
54
55
  - The core product type matches, but the product clearly contradicts an explicit and important requirement in the query.
  
  Typical cases:
  - Query: "pants", product: "shoes"  wrong product type.
  - Query: "dress", product: "skirt"  different product type.
  - Query: "fitted pants", product: "loose wide-leg pants"  explicit contradiction on fit.
  - Query: "sleeveless dress", product: "long sleeve dress"  explicit contradiction on sleeve style.
  
3ac1f8d1   tangwang   评估标准优化
56
57
  This label emphasizes clarity of user intent. When the query specifies a concrete product type or an important attribute, products that conflict with that intent should be judged Irrelevant even if they are related at a higher category level.
  
3984ec64   tangwang   evalution 标注标准优化
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
  ## Decision Principles
  
  1. Product type is the highest-priority factor.
     If the query clearly specifies a concrete product type, the result must match that product type to be Exact or Partial.
     A different product type is usually Irrelevant, not Partial.
  
  2. Similar or related product types are not interchangeable when the query is specific.
     For example:
     - dress vs skirt vs jumpsuit
     - jeans vs pants
     - t-shirt vs blouse
     - cardigan vs sweater
     - boots vs shoes
     - bra vs top
     - backpack vs bag
     If the user explicitly searched for one of these, the others should usually be judged Irrelevant.
  
  3. If the core product type matches, then evaluate attributes.
3ac1f8d1   tangwang   评估标准优化
76
77
78
     - If all explicit attributes match  Exact
     - If some attributes are missing, uncertain, or partially mismatched, but the item is still an acceptable substitute  Partial
     - If an explicit and important attribute is clearly contradicted, and the item is not a reasonable substitute  Irrelevant
3984ec64   tangwang   evalution 标注标准优化
79
80
81
  
  4. Distinguish carefully between "not mentioned" and "contradicted".
     - If an attribute is not mentioned or cannot be verified, prefer Partial.
3ac1f8d1   tangwang   评估标准优化
82
     - If an attribute is explicitly opposite to the query, use Irrelevant unless the item is still reasonably acceptable as a substitute under the shopping context.
3984ec64   tangwang   evalution 标注标准优化
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
  
  Query: {query}
  
  Products:
  {lines}
  
  ## Output Format
  Strictly output {n} lines, each line containing exactly one of:
  Exact
  Partial
  Irrelevant
  
  The lines must correspond sequentially to the products above.
  Do not output any other information.
  """
  
3ac1f8d1   tangwang   评估标准优化
99
100
  _CLASSIFY_BATCH_SIMPLE_TEMPLATE__zh = _CLASSIFY_BATCH_SIMPLE_TEMPLATE_ZH = """你是一个服饰电商搜索系统中的相关性判断助手。
  给定用户查询词以及每个商品的信息,请为每个商品分配一个相关性标签。
3984ec64   tangwang   evalution 标注标准优化
101
102
103
104
  
  ## 相关性标签
  
  ### 完全相关
3ac1f8d1   tangwang   评估标准优化
105
  核心产品类型匹配,所有明确提及的关键属性均有产品信息支撑。
3984ec64   tangwang   evalution 标注标准优化
106
  
3ac1f8d1   tangwang   评估标准优化
107
108
109
  典型适用场景:
  - 查询仅包含产品类型,产品即为该类型。
  - 查询包含“产品类型 + 属性”,产品在类型及所有明确属性上均符合。
3984ec64   tangwang   evalution 标注标准优化
110
111
  
  ### 部分相关
3ac1f8d1   tangwang   评估标准优化
112
  产品满足用户的主要意图(核心产品类型匹配),但查询中明确的部分要求未体现,或存在偏差。虽然有不一致,但仍属于“非目标但可接受”的替代品。
3984ec64   tangwang   evalution 标注标准优化
113
114
  
  在以下情况使用部分相关:
3ac1f8d1   tangwang   评估标准优化
115
116
117
  - 核心产品类型匹配,但部分请求的属性在商品信息中缺失、未提及或无法确认。
  - 核心产品类型匹配,但材质、版型、风格等次要要求存在偏差或不一致。
  - 商品不是用户最理想的目标,但从电商购物角度看,仍可能被用户视为可接受的替代品。
3984ec64   tangwang   evalution 标注标准优化
118
119
  
  典型情况:
3ac1f8d1   tangwang   评估标准优化
120
121
  - 查询:“红色修身T恤”,产品:“女士T恤”  颜色/版型无法确认。
  - 查询:“红色修身T恤”,产品:“蓝色修身T恤”  产品类型和版型匹配,但颜色不同。
3984ec64   tangwang   evalution 标注标准优化
122
  
3ac1f8d1   tangwang   评估标准优化
123
124
125
  详细案例:
  - 查询:“棉质长袖衬衫”
  - 商品:“J.VER男式亚麻衬衫休闲纽扣长袖衬衫纯色平领夏季沙滩衬衫带口袋”
3984ec64   tangwang   evalution 标注标准优化
126
  
3ac1f8d1   tangwang   评估标准优化
127
128
129
130
131
  分析:
  - 材质不符:Query 明确指定“棉质”,而商品为“亚麻”,因此不能判为完全相关。
  - 但核心品类仍然匹配:两者都是“长袖衬衫”。
  - 在电商搜索中,用户仍可能因为款式、穿着场景相近而点击该商品。
  - 因此应判为部分相关,即“非目标但可接受”的替代品。
3984ec64   tangwang   evalution 标注标准优化
132
  
3ac1f8d1   tangwang   评估标准优化
133
134
  ### 不相关
  产品未满足用户的主要购物意图,主要表现为以下情形之一:
3984ec64   tangwang   evalution 标注标准优化
135
  - 核心产品类型与查询不匹配。
3ac1f8d1   tangwang   评估标准优化
136
137
  - 产品虽属大致相关的大类,但与查询指定的具体子类不可互换。
  - 核心产品类型匹配,但产品明显违背了查询中一个明确且重要的属性要求。
3984ec64   tangwang   evalution 标注标准优化
138
139
  
  典型情况:
3ac1f8d1   tangwang   评估标准优化
140
141
142
143
  - 查询:“裤子”,产品:“鞋子”  产品类型错误。
  - 查询:“连衣裙”,产品:“半身裙”  具体产品类型不同。
  - 查询:“修身裤”,产品:“宽松阔腿裤”  与版型要求明显冲突。
  - 查询:“无袖连衣裙”,产品:“长袖连衣裙”  与袖型要求明显冲突。
3984ec64   tangwang   evalution 标注标准优化
144
  
3ac1f8d1   tangwang   评估标准优化
145
  该标签强调用户意图的明确性。当查询指向具体类型或关键属性时,即使产品在更高层级类别上相关,也应按不相关处理。
3984ec64   tangwang   evalution 标注标准优化
146
  
3ac1f8d1   tangwang   评估标准优化
147
  ## 判断原则
3984ec64   tangwang   evalution 标注标准优化
148
  
3ac1f8d1   tangwang   评估标准优化
149
150
151
152
153
  1. 产品类型是最高优先级因素。
     如果查询明确指定了具体产品类型,那么结果必须匹配该产品类型,才可能判为“完全相关”或“部分相关”。
     不同产品类型通常应判为“不相关”,而不是“部分相关”。
  
  2. 相似或相关的产品类型,在查询明确时通常不可互换。
3984ec64   tangwang   evalution 标注标准优化
154
155
156
     例如:
     - 连衣裙 vs 半身裙 vs 连体裤
     - 牛仔裤 vs 裤子
3ac1f8d1   tangwang   评估标准优化
157
     - T vs 衬衫/上衣
3984ec64   tangwang   evalution 标注标准优化
158
159
160
161
     - 开衫 vs 毛衣
     - 靴子 vs 鞋子
     - 文胸 vs 上衣
     - 双肩包 vs 
3ac1f8d1   tangwang   评估标准优化
162
     如果用户明确搜索其中一种,其他类型通常应判为“不相关”。
3984ec64   tangwang   evalution 标注标准优化
163
  
3ac1f8d1   tangwang   评估标准优化
164
165
166
167
  3. 当核心产品类型匹配后,再评估属性。
     - 所有明确属性都匹配  完全相关
     - 部分属性缺失、无法确认,或存在一定偏差,但仍是可接受替代品  部分相关
     - 明确且重要的属性被明显违背,且不能作为合理替代品  不相关
3984ec64   tangwang   evalution 标注标准优化
168
  
3ac1f8d1   tangwang   评估标准优化
169
170
171
  4. 要严格区分“未提及/无法确认”和“明确冲突”。
     - 如果某属性没有提及,或无法验证,优先判为“部分相关”。
     - 如果某属性与查询要求明确相反,则判为“不相关”;除非在购物语境下它仍明显属于可接受替代品。
3984ec64   tangwang   evalution 标注标准优化
172
  
3ac1f8d1   tangwang   评估标准优化
173
  查询:{query}
3984ec64   tangwang   evalution 标注标准优化
174
  
3ac1f8d1   tangwang   评估标准优化
175
  商品:
3984ec64   tangwang   evalution 标注标准优化
176
177
178
  {lines}
  
  ## 输出格式
3ac1f8d1   tangwang   评估标准优化
179
180
181
182
  严格输出 {n} 行,每行只能是以下三者之一:
  完全相关
  部分相关
  不相关
3984ec64   tangwang   evalution 标注标准优化
183
  
3ac1f8d1   tangwang   评估标准优化
184
185
  输出行必须与上方商品顺序一一对应。
  不要输出任何其他内容。
3984ec64   tangwang   evalution 标注标准优化
186
187
188
  """
  
  
c81b0fc1   tangwang   scripts/evaluatio...
189
190
191
192
  
  def classify_batch_simple_prompt(query: str, numbered_doc_lines: Sequence[str]) -> str:
      lines = "\n".join(numbered_doc_lines)
      n = len(numbered_doc_lines)
3984ec64   tangwang   evalution 标注标准优化
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
      return _CLASSIFY_BATCH_SIMPLE_TEMPLATE.format(query=query, lines=lines, n=n)
  
  
  _EXTRACT_QUERY_PROFILE_TEMPLATE = """You are building a structured intent profile for e-commerce relevance judging.
  Use the original user query as the source of truth. Parser hints may help, but if a hint conflicts with the original query, trust the original query.
  Be conservative: only mark an attribute as required if the user explicitly asked for it.
  
  Return JSON with this schema:
  {{
    "normalized_query_en": string,
    "primary_category": string,
    "allowed_categories": [string],
    "required_attributes": [
      {{"name": string, "required_terms": [string], "conflicting_terms": [string], "match_mode": "explicit"}}
    ],
    "notes": [string]
  }}
  
  Guidelines:
  - Exact later will require explicit evidence for all required attributes.
  - allowed_categories should contain only near-synonyms of the same product type, not substitutes. For example dress can allow midi dress/cocktail dress, but not skirt, top, jumpsuit, or outfit unless the query explicitly asks for them.
  - If the query asks for dress/skirt/jeans/t-shirt, near but different product types are not Exact.
  - If the query includes color, fit, silhouette, or length, include them as required_attributes.
  - For fit words, include conflicting terms when obvious, e.g. fitted conflicts with oversized/loose; oversized conflicts with fitted/tight.
  - For color, include conflicting colors only when clear from the query.
  
  Original query: {query}
  Parser hints JSON: {hints_json}
  """
c81b0fc1   tangwang   scripts/evaluatio...
222
223
224
225
  
  
  def extract_query_profile_prompt(query: str, parser_hints: Dict[str, Any]) -> str:
      hints_json = json.dumps(parser_hints, ensure_ascii=False)
3984ec64   tangwang   evalution 标注标准优化
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
      return _EXTRACT_QUERY_PROFILE_TEMPLATE.format(query=query, hints_json=hints_json)
  
  
  _CLASSIFY_BATCH_COMPLEX_TEMPLATE = """You are an e-commerce search relevance judge.
  Judge each product against the structured query profile below.
  
  Relevance rules:
  - Exact: product type matches the target intent, and every explicit required attribute is positively supported by the title/options/tags/category. If an attribute is missing or only guessed, it is NOT Exact.
  - Partial: main product type/use case matches, but some required attribute is missing, weaker, uncertain, or only approximately matched.
  - Irrelevant: product type/use case mismatched, or an explicit required attribute clearly conflicts.
  - Be conservative with Exact.
  - Graphic/holiday/message tees are not Exact for a plain color/style tee query unless that graphic/theme was requested.
  - Jumpsuit/romper/set is not Exact for dress/skirt/jeans queries.
  
  Original query: {query}
  Structured query profile JSON: {profile_json}
  
  Products:
  {lines}
  
  Return JSON only, with schema:
  {{"labels":[{{"index":1,"label":"Exact","reason":"short phrase"}}]}}
  """
c81b0fc1   tangwang   scripts/evaluatio...
249
250
251
252
253
254
255
256
257
  
  
  def classify_batch_complex_prompt(
      query: str,
      query_profile: Dict[str, Any],
      numbered_doc_lines: Sequence[str],
  ) -> str:
      lines = "\n".join(numbered_doc_lines)
      profile_json = json.dumps(query_profile, ensure_ascii=False)
3984ec64   tangwang   evalution 标注标准优化
258
259
260
261
      return _CLASSIFY_BATCH_COMPLEX_TEMPLATE.format(
          query=query,
          profile_json=profile_json,
          lines=lines,
c81b0fc1   tangwang   scripts/evaluatio...
262
      )