Blame view

scripts/evaluation/eval_framework/prompts.py 20.7 KB
c81b0fc1   tangwang   scripts/evaluatio...
1
2
3
4
5
6
7
  """LLM prompt templates for relevance judging (keep wording changes here)."""
  
  from __future__ import annotations
  
  import json
  from typing import Any, Dict, Sequence
  
46d94a05   tangwang   评估标准修改
8
9
10
11
12
  _CLASSIFY_BATCH_SIMPLE_TEMPLATE = """You are a relevance judgment assistant for a fashion e-commerce search system.
  Given a user query and the information for each product, assign a relevance label to each product.
  
  Your goal is to judge relevance from the perspective of e-commerce search ranking.
  The key question is whether the user would view the product as the intended item, or as an acceptable substitute.
3984ec64   tangwang   evalution 标注标准优化
13
14
15
  
  ## Relevance Labels
  
46d94a05   tangwang   评估标准修改
16
17
  ### Exact Match
  The product satisfies the users core shopping intent: the core product type matches, and all explicitly stated key attributes in the query are supported by the product information, with no obvious conflict.
3984ec64   tangwang   evalution 标注标准优化
18
  
3ac1f8d1   tangwang   评估标准优化
19
20
  Typical use cases:
  - The query contains only a product type, and the product is exactly that type.
46d94a05   tangwang   评估标准修改
21
  - The query contains product type + attributes, and the product matches both the type and all explicitly stated attributes.
3984ec64   tangwang   evalution 标注标准优化
22
  
46d94a05   tangwang   评估标准修改
23
24
  ### High Relevant
  The product satisfies the users main intent: the core product type matches, but some explicitly requested attributes are missing from the product information, cannot be confirmed, or show minor / non-critical deviations. The product is still a good substitute for the users core need.
3984ec64   tangwang   evalution 标注标准优化
25
  
46d94a05   tangwang   评估标准修改
26
27
28
29
  Use High Relevant in the following cases:
  - The core product type matches, but some requested attributes are missing, not mentioned, or cannot be verified.
  - The core product type matches, but attributes such as color, material, style, fit, or length have minor deviations, as long as the deviation does not materially undermine the users main shopping intent.
  - The product is not the users ideal target, but in an e-commerce shopping context, it would still be considered an acceptable and strong substitute.
3984ec64   tangwang   evalution 标注标准优化
30
  
46d94a05   tangwang   评估标准修改
31
32
33
34
35
36
37
  Typical examples:
  - Query: red slim-fit T-shirt
    Product: womens T-shirt
     Color and fit cannot be confirmed.
  - Query: red slim-fit T-shirt
    Product: blue slim-fit T-shirt
     Product type and fit match, but the color is different.
3984ec64   tangwang   evalution 标注标准优化
38
  
46d94a05   tangwang   评估标准修改
39
40
41
  Detailed case:
  - Query: cotton long-sleeve shirt
  - Product: J.VER Men's Linen Shirt Casual Button Down Long Sleeve Solid Plain Collar Summer Beach Shirt with Pocket”
3ac1f8d1   tangwang   评估标准优化
42
43
  
  Analysis:
46d94a05   tangwang   评估标准修改
44
45
46
47
  - Material mismatch: the query explicitly requires cotton, while the product is linen, so it cannot be labeled as Exact Match.
  - However, the core category still matches: both are long-sleeve shirts.
  - In e-commerce search, users may still click this item because the style and wearing scenario are similar.
  - Therefore, it should be labeled as High Relevant: not the exact target, but a good substitute.
3984ec64   tangwang   evalution 标注标准优化
48
  
46d94a05   tangwang   评估标准修改
49
50
51
  Detailed case:
  - Query: black mid-length skirt
  - Product: New spring autumn loose slimming full long floral skirt pleated skirt
3984ec64   tangwang   evalution 标注标准优化
52
  
46d94a05   tangwang   评估标准修改
53
54
55
56
57
58
  Analysis:
  - Category match: the product is a skirt, so the category matches.
  - Color mismatch: the product description does not indicate black and explicitly mentions floral, which is substantially different from plain black.
  - Length deviation: the user asks for mid-length, while the product title emphasizes long skirt, which is somewhat longer.
  - However, the core category skirt still matches, and style features such as slimming and full skirt may still fit some preferences of users searching for a mid-length skirt. Also, long versus mid-length is a deviation, but not a severe contradiction.
  - Therefore, this should be labeled as High Relevant: the core type matches, but there are several non-fatal attribute deviations.
3984ec64   tangwang   evalution 标注标准优化
59
  
46d94a05   tangwang   评估标准修改
60
61
  ### Low Relevant
  The product has a noticeable gap from the users core target, but still shares some similarity with the query in style, scenario, function, or broader category. A small portion of users may still view it as a barely acceptable substitute. It is not the intended item, but still has some relevance.
3984ec64   tangwang   evalution 标注标准优化
62
  
46d94a05   tangwang   评估标准修改
63
64
65
66
  Use Low Relevant in the following cases:
  - The core product type does not match, but the two types are still very close in style, wearing scenario, or function, so there is still some substitutability.
  - The core product type matches, but the product differs from the users ideal target on multiple attributes; it still has some relevance, but is no longer a strong substitute.
  - An important query requirement is clearly violated, but the product still retains a limited reason to be clicked.
3ac1f8d1   tangwang   评估标准优化
67
  
46d94a05   tangwang   评估标准修改
68
69
70
71
  Typical cases:
  - Query: black mid-length skirt
    Product: New high-waisted V-neck mid-length dress elegant printed black sexy dress
     The core product type differs (skirt vs dress), but both belong to closely related apparel types and share a similar mid-length style, so it is Low Relevant.
3984ec64   tangwang   evalution 标注标准优化
72
  
46d94a05   tangwang   评估标准修改
73
74
75
  - Query: jeans
    Product: casual pants
     The core product type is different, but both belong to the broader pants category, and the style / wearing scenario may still be close enough to be a weak substitute.
3984ec64   tangwang   evalution 标注标准优化
76
  
46d94a05   tangwang   评估标准修改
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
  ### Irrelevant
  The product does not satisfy the users main shopping intent, and the likelihood of user engagement is very low.
  
  Typical situations:
  - The core product type does not match the query and is not a close substitute in style, scenario, or function.
  - The product belongs to a roughly related broader category, but not to an interchangeable subtype explicitly requested in the query, and the style or usage scenario differs significantly.
  - The core product type matches, but the product clearly violates an explicit and important requirement in the query, with little or no acceptable substitutability.
  
  Typical examples:
  - Query: pants
    Product: shoes
     Wrong product type.
  - Query: slim-fit pants
    Product: loose wide-leg pants
     Clear contradiction in fit, with extremely low substitutability.
  - Query: sleeveless dress
    Product: long-sleeve dress
     Clear contradiction in sleeve type.
  - Query: jeans
    Product: sweatpants
     Different core category, with significantly different style and wearing scenario.
  - Query: boots
    Product: sneakers
     Different core category, different function, and different usage scenario.
  
  ## Judgment Principles
  
  1. **Product type is the highest-priority factor.**
     If the query explicitly specifies a concrete product type, the result must match that product type in order to be labeled as Exact Match or High Relevant.
     Different product types should usually be labeled as Low Relevant or Irrelevant.
  
     - **Low Relevant**: use only when the two product types are very close in style, scenario, or function, and the user may still treat one as a barely acceptable substitute for the other.
     - **Irrelevant**: all other product type mismatch cases.
  
  2. **Similar or related product types are usually not directly interchangeable when the query is explicit, but their closeness should determine whether the label is Low Relevant or Irrelevant.**
3984ec64   tangwang   evalution 标注标准优化
112
     For example:
46d94a05   tangwang   评估标准修改
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
     - **May be Low Relevant due to strong similarity in style / scenario**: dress vs skirt, long skirt vs mid-length skirt, jeans vs casual pants, sneakers vs skate shoes.
     - **Should be Irrelevant due to substantial difference in style / scenario**: pants vs shoes, T-shirt vs hat, boots vs sneakers, jeans vs suit pants, backpack vs handbag.
  
  3. **Once the core product type matches, evaluate attributes.**
     - All explicit attributes match  **Exact Match**
     - Some attributes are missing, not mentioned, cannot be verified, or show only minor deviations  **High Relevant**
     - There are multiple attribute deviations, or an important attribute is clearly violated, but the product still retains some substitutability  **Low Relevant**
     - There is a clear and important hard conflict, and substitutability is extremely low  **Irrelevant**
  
  4. **Strictly distinguish among not mentioned / cannot confirm, minor deviation, and explicit contradiction.**
     - If an attribute is not mentioned or cannot be verified, prefer **High Relevant**.
     - If an attribute shows a minor deviation, such as different color, different material, or slightly different length, it should usually be labeled **High Relevant**.
     - If an attribute is explicitly opposite to the query requirement, such as sleeveless vs long-sleeve or slim-fit vs loose wide-leg, decide between **Low Relevant** and **Irrelevant** based on the severity of the conflict and practical substitutability.
     - If the conflict directly breaks the users main shopping goal, it should usually be labeled **Irrelevant**.
  
  5. **Substitutability should be judged from real shopping intent, not just surface-level textual similarity.**
     The question is whether the user would realistically accept the product in a shopping scenario.
     - Good substitute  **High Relevant**
     - Barely acceptable substitute  **Low Relevant**
     - Hardly substitutable at all  **Irrelevant**
  
  6. **When product information is insufficient, do not treat cannot confirm as conflict.**
     If a product does not mention an attribute, that does not mean the attribute is definitely violated.
     Therefore:
     - If the attribute is not mentioned or cannot be confirmed, prefer **High Relevant**;
     - Only treat it as a conflict when the product information clearly shows the opposite of the query requirement.
3984ec64   tangwang   evalution 标注标准优化
139
140
141
142
143
144
145
  
  Query: {query}
  
  Products:
  {lines}
  
  ## Output Format
46d94a05   tangwang   评估标准修改
146
147
148
149
150
  Output exactly {n} lines.
  Each line must be exactly one of the following:
  Exact Match
  High Relevant
  Low Relevant
3984ec64   tangwang   evalution 标注标准优化
151
152
  Irrelevant
  
46d94a05   tangwang   评估标准修改
153
154
  The output lines must correspond to the products above in the same order.
  Do not output anything else.
3984ec64   tangwang   evalution 标注标准优化
155
156
  """
  
bdb65283   tangwang   标注框架 批量标注
157
  _CLASSIFY_BATCH_SIMPLE_TEMPLATE_ZH = """你是一个服饰电商搜索系统中的相关性判断助手。
3ac1f8d1   tangwang   评估标准优化
158
  给定用户查询词以及每个商品的信息,请为每个商品分配一个相关性标签。
3984ec64   tangwang   evalution 标注标准优化
159
  
46d94a05   tangwang   评估标准修改
160
161
162
  你的目标是从电商搜索排序的角度,判断商品是否满足用户的购物意图。
  判断时应优先考虑“用户是否会把该商品视为目标商品,或可接受的替代品”。
  
3984ec64   tangwang   evalution 标注标准优化
163
164
165
  ## 相关性标签
  
  ### 完全相关
46d94a05   tangwang   评估标准修改
166
  商品满足用户的核心购物意图:核心商品类型匹配,且查询中所有明确提及的关键属性均有商品信息支持。
3984ec64   tangwang   evalution 标注标准优化
167
  
3ac1f8d1   tangwang   评估标准优化
168
  典型适用场景:
46d94a05   tangwang   评估标准修改
169
170
171
172
173
  - 查询仅包含商品类型,商品即为该类型。
  - 查询包含“商品类型 + 属性”,商品在类型及所有明确属性上均符合。
  
  ### 基本相关
  商品满足用户的主要意图:核心商品类型匹配,但查询中明确提出的部分要求未在商品信息中体现、无法确认,或存在轻微偏差 / 非关键偏差。该商品仍是满足用户核心需求的良好替代品。
3984ec64   tangwang   evalution 标注标准优化
174
  
46d94a05   tangwang   评估标准修改
175
176
177
178
  在以下情况使用“基本相关”:
  - 核心商品类型匹配,但部分属性缺失、未提及或无法确认。
  - 核心商品类型匹配,但颜色、材质、风格、版型、长度等属性存在轻微偏差,只要这种偏差不会明显破坏用户的主要购买意图。
  - 商品不是用户最理想的目标,但在电商购物场景下仍可能被视为可接受、且较优的替代品。
3984ec64   tangwang   evalution 标注标准优化
179
180
  
  典型情况:
46d94a05   tangwang   评估标准修改
181
182
183
184
  - 查询:“红色修身T恤”,商品:“女士T恤”
     颜色、版型无法确认。
  - 查询:“红色修身T恤”,商品:“蓝色修身T恤”
     商品类型和版型匹配,但颜色不同。
3984ec64   tangwang   evalution 标注标准优化
185
  
3ac1f8d1   tangwang   评估标准优化
186
187
188
  详细案例:
  - 查询:“棉质长袖衬衫”
  - 商品:“J.VER男式亚麻衬衫休闲纽扣长袖衬衫纯色平领夏季沙滩衬衫带口袋”
3984ec64   tangwang   evalution 标注标准优化
189
  
3ac1f8d1   tangwang   评估标准优化
190
  分析:
46d94a05   tangwang   评估标准修改
191
  - 材质不符:Query 明确指定“棉质”,而商品为“亚麻”,因此不能判为“完全相关”。
3ac1f8d1   tangwang   评估标准优化
192
193
  - 但核心品类仍然匹配:两者都是“长袖衬衫”。
  - 在电商搜索中,用户仍可能因为款式、穿着场景相近而点击该商品。
46d94a05   tangwang   评估标准修改
194
  - 因此应判为“基本相关”,即“非精确目标,但属于良好替代品”。
3984ec64   tangwang   evalution 标注标准优化
195
  
bdb65283   tangwang   标注框架 批量标注
196
197
198
199
200
  详细案例:
  - 查询:“黑色中长半身裙”
  - 商品:“春秋季新款宽松显瘦大摆长裙碎花半身裙褶皱设计裙”
  
  分析:
46d94a05   tangwang   评估标准修改
201
202
203
204
205
206
207
208
209
210
211
212
213
  - 品类匹配:商品是“半身裙”,品类符合。
  - 颜色不匹配:商品描述未提及黑色,且明确包含“碎花”,与纯黑差异较大。
  - 长度存在偏差:用户要求“中长”,而商品标题强调“长裙”,长度偏长。
  - 但核心品类“半身裙”匹配,“显瘦”“大摆”等风格特征仍可能符合部分搜索“中长半身裙”用户的潜在偏好;同时“长裙”和“中长”虽有偏差,但不构成严重对立。
  - 因此应判为“基本相关”:核心品类匹配,但存在若干非致命属性偏差。
  
  ### 弱相关
  商品与用户的核心目标存在明显差距,但仍与查询在风格、场景、功能或大类上具有一定相似性,可能被少量用户视为勉强可接受的替代品。属于“非目标,但仍有一定关联”。
  
  在以下情况使用“弱相关”:
  - 核心商品类型不一致,但两者在风格、穿着场景或功能上非常接近,仍具有一定替代性。
  - 核心商品类型匹配,但在多个属性上与用户理想目标差距较大,虽仍有一定关联性,但已不是高质量替代品。
  - 查询要求中的某个重要属性被明显违背,但商品仍保留少量被点击的理由。
bdb65283   tangwang   标注框架 批量标注
214
215
  
  典型情况:
46d94a05   tangwang   评估标准修改
216
217
  - 查询:“黑色中长半身裙”,商品:“新款高腰V领中长款连衣裙 优雅印花黑色性感连衣裙”
     核心商品类型“半身裙”与“连衣裙”不同,但两者同属裙装,且款式上均为“中长款”,在穿搭场景上接近,因此属于“弱相关”。
bdb65283   tangwang   标注框架 批量标注
218
  
46d94a05   tangwang   评估标准修改
219
220
221
222
223
224
225
226
227
228
  - 查询:“牛仔裤”,商品:“休闲裤”
     核心商品类型不同,但同属裤装大类,风格和穿着场景可能接近,可作为较弱替代品。
  
  ### 不相关
  商品未满足用户的主要购物意图,用户点击动机极低。
  
  主要表现为以下情形之一:
  - 核心商品类型与查询不匹配,且不属于风格 / 场景 / 功能接近的可替代品。
  - 商品虽属于大致相关的大类,但与查询明确指定的具体子类不可互换,且风格或场景差异大。
  - 核心商品类型匹配,但商品明显违背了查询中一个明确且重要的要求,且几乎不具备可接受的替代性。
3984ec64   tangwang   evalution 标注标准优化
229
230
  
  典型情况:
46d94a05   tangwang   评估标准修改
231
232
233
234
235
236
237
238
239
240
  - 查询:“裤子”,商品:“鞋子”
     商品类型错误。
  - 查询:“修身裤”,商品:“宽松阔腿裤”
     与版型要求明显冲突,替代性极低。
  - 查询:“无袖连衣裙”,商品:“长袖连衣裙”
     与袖型要求明显冲突。
  - 查询:“牛仔裤”,商品:“运动裤”
     核心品类不同(牛仔裤 vs 运动裤),风格和场景差异大。
  - 查询:“靴子”,商品:“运动鞋”
     核心品类不同,功能和适用场景差异大。
3984ec64   tangwang   evalution 标注标准优化
241
  
3ac1f8d1   tangwang   评估标准优化
242
  ## 判断原则
3984ec64   tangwang   evalution 标注标准优化
243
  
46d94a05   tangwang   评估标准修改
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
  1. **商品类型是最高优先级因素。**
     如果查询明确指定了具体商品类型,那么结果必须匹配该商品类型,才可能判为“完全相关”或“基本相关”。
     不同商品类型通常应判为“弱相关”或“不相关”。
  
     - **弱相关**:仅当两种商品类型在风格、场景、功能上非常接近,用户有一定概率将其视为勉强可接受的替代品时使用。
     - **不相关**:其他所有商品类型不匹配的情况。
  
  2. **相似或相关的商品类型,在查询明确时通常不可直接互换,但要根据接近程度区分“弱相关”与“不相关”。**
     例如:
     - **风格 / 场景高度接近,可判为弱相关**:连衣裙 vs 半身裙、长裙 vs 中长裙、牛仔裤 vs 休闲裤、运动鞋 vs 板鞋。
     - **风格 / 场景差异大,应判为不相关**:裤子 vs 鞋子、T vs 帽子、靴子 vs 运动鞋、牛仔裤 vs 西装裤、双肩包 vs 手提包。
  
  3. **当核心商品类型匹配后,再评估属性。**
     - 所有明确属性都匹配  **完全相关**
     - 部分属性缺失、未提及、无法确认,或存在轻微偏差  **基本相关**
     - 存在多个属性偏差,或某个重要属性被明显违背,但商品仍保留一定替代性  **弱相关**
     - 存在明确且重要的强冲突,且替代性极低  **不相关**
  
  4. **要严格区分“未提及 / 无法确认”“轻微偏差”“明确冲突”。**
     - 如果某属性没有提及,或无法验证,优先判为“基本相关”。
     - 如果某属性存在轻微偏差,如颜色不同、材质不同、长度略有差异,通常判为“基本相关”。
     - 如果某属性与查询要求明确相反,如无袖 vs 长袖、修身 vs 宽松阔腿,则要根据冲突严重性与替代性,在“弱相关”与“不相关”之间判断。
     - 若该冲突会直接破坏用户的主要购买目标,通常判为“不相关”。
  
  5. **“是否可替代”应从真实电商购物意图出发判断。**
     不是只看字面相似,而要看用户在购物场景下是否可能接受该商品。
     - 良好替代品  **基本相关**
     - 勉强替代品  **弱相关**
     - 几乎不可替代  **不相关**
  
  6. **若商品信息不足,不要把“无法确认”误判为“冲突”。**
     商品未写明某属性,不等于该属性一定不符合。
     因此:
     - 未提及 / 无法确认,优先按“基本相关”处理;
     - 只有当商品信息明确显示与查询要求相反时,才视为属性冲突。
3984ec64   tangwang   evalution 标注标准优化
279
  
3ac1f8d1   tangwang   评估标准优化
280
  查询:{query}
3984ec64   tangwang   evalution 标注标准优化
281
  
3ac1f8d1   tangwang   评估标准优化
282
  商品:
3984ec64   tangwang   evalution 标注标准优化
283
284
285
  {lines}
  
  ## 输出格式
46d94a05   tangwang   评估标准修改
286
  严格输出 {n} 行,每行只能是以下四者之一:
3ac1f8d1   tangwang   评估标准优化
287
  完全相关
46d94a05   tangwang   评估标准修改
288
289
  基本相关
  弱相关
3ac1f8d1   tangwang   评估标准优化
290
  不相关
3984ec64   tangwang   evalution 标注标准优化
291
  
3ac1f8d1   tangwang   评估标准优化
292
293
  输出行必须与上方商品顺序一一对应。
  不要输出任何其他内容。
3984ec64   tangwang   evalution 标注标准优化
294
295
296
  """
  
  
c81b0fc1   tangwang   scripts/evaluatio...
297
298
299
  def classify_batch_simple_prompt(query: str, numbered_doc_lines: Sequence[str]) -> str:
      lines = "\n".join(numbered_doc_lines)
      n = len(numbered_doc_lines)
3984ec64   tangwang   evalution 标注标准优化
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
      return _CLASSIFY_BATCH_SIMPLE_TEMPLATE.format(query=query, lines=lines, n=n)
  
  
  _EXTRACT_QUERY_PROFILE_TEMPLATE = """You are building a structured intent profile for e-commerce relevance judging.
  Use the original user query as the source of truth. Parser hints may help, but if a hint conflicts with the original query, trust the original query.
  Be conservative: only mark an attribute as required if the user explicitly asked for it.
  
  Return JSON with this schema:
  {{
    "normalized_query_en": string,
    "primary_category": string,
    "allowed_categories": [string],
    "required_attributes": [
      {{"name": string, "required_terms": [string], "conflicting_terms": [string], "match_mode": "explicit"}}
    ],
    "notes": [string]
  }}
  
  Guidelines:
  - Exact later will require explicit evidence for all required attributes.
  - allowed_categories should contain only near-synonyms of the same product type, not substitutes. For example dress can allow midi dress/cocktail dress, but not skirt, top, jumpsuit, or outfit unless the query explicitly asks for them.
  - If the query asks for dress/skirt/jeans/t-shirt, near but different product types are not Exact.
  - If the query includes color, fit, silhouette, or length, include them as required_attributes.
  - For fit words, include conflicting terms when obvious, e.g. fitted conflicts with oversized/loose; oversized conflicts with fitted/tight.
  - For color, include conflicting colors only when clear from the query.
  
  Original query: {query}
  Parser hints JSON: {hints_json}
  """
c81b0fc1   tangwang   scripts/evaluatio...
329
330
331
332
  
  
  def extract_query_profile_prompt(query: str, parser_hints: Dict[str, Any]) -> str:
      hints_json = json.dumps(parser_hints, ensure_ascii=False)
3984ec64   tangwang   evalution 标注标准优化
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
      return _EXTRACT_QUERY_PROFILE_TEMPLATE.format(query=query, hints_json=hints_json)
  
  
  _CLASSIFY_BATCH_COMPLEX_TEMPLATE = """You are an e-commerce search relevance judge.
  Judge each product against the structured query profile below.
  
  Relevance rules:
  - Exact: product type matches the target intent, and every explicit required attribute is positively supported by the title/options/tags/category. If an attribute is missing or only guessed, it is NOT Exact.
  - Partial: main product type/use case matches, but some required attribute is missing, weaker, uncertain, or only approximately matched.
  - Irrelevant: product type/use case mismatched, or an explicit required attribute clearly conflicts.
  - Be conservative with Exact.
  - Graphic/holiday/message tees are not Exact for a plain color/style tee query unless that graphic/theme was requested.
  - Jumpsuit/romper/set is not Exact for dress/skirt/jeans queries.
  
  Original query: {query}
  Structured query profile JSON: {profile_json}
  
  Products:
  {lines}
  
  Return JSON only, with schema:
  {{"labels":[{{"index":1,"label":"Exact","reason":"short phrase"}}]}}
  """
c81b0fc1   tangwang   scripts/evaluatio...
356
357
358
359
360
361
362
363
364
  
  
  def classify_batch_complex_prompt(
      query: str,
      query_profile: Dict[str, Any],
      numbered_doc_lines: Sequence[str],
  ) -> str:
      lines = "\n".join(numbered_doc_lines)
      profile_json = json.dumps(query_profile, ensure_ascii=False)
3984ec64   tangwang   evalution 标注标准优化
365
366
367
368
      return _CLASSIFY_BATCH_COMPLEX_TEMPLATE.format(
          query=query,
          profile_json=profile_json,
          lines=lines,
c81b0fc1   tangwang   scripts/evaluatio...
369
      )