Blame view

.cursor/plans/es-query-25a9f060.plan.检索表达式优化.ES_function表达式.md 8.38 KB
43f1139f   tangwang   refactor: ES查询结构重...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
  <!-- 25a9f060-257b-486f-b598-bbb062d1adf9 c200c78c-4d12-4062-865a-fa2adf92bdd9 -->
  # ES查询结构重构与排序优化
  
  ## 核心改动
  
  ### 1. ES查询结构优化(方案C)
  
  **目标结构**
  
  ```json
  {
    "query": {
      "function_score": {
        "query": {
          "bool": {
            "must": [
              {
                "bool": {
                  "should": [
                    { "multi_match": {...} },  // 文本查询
                    { "knn": {...} }  // KNN查询
                  ],
                  "minimum_should_match": 1  // 至少匹配一个
                }
              }
            ],
            "filter": [...]  // 过滤器作用于整体
          }
        },
        "functions": [
          {
            "filter": {"range": {"days_since_last_update": {"lte": 30}}},
            "weight": 1.1
          }
        ],
        "score_mode": "sum",
        "boost_mode": "multiply"
      }
    }
  }
  ```
  
  **关键改进**
  
  - filter在外层bool中,同时作用于文本和KNN
  - 文本和KNN在should中,minimum_should_match=1确保至少匹配一个
  - function_score包裹整体,支持额外打分因子
  
  ### 2. 文件修改清单
  
  #### `/home/tw/SearchEngine/search/multilang_query_builder.py`
  
  **修改 `build_multilang_query` 方法**(约156-190行):
  
  当前代码(有问题):
  
  ```python
  es_query = {"size": size, "from": from_}
  
  if filters or range_filters:
      filter_clauses = self._build_filters(filters, range_filters)
      if filter_clauses:
          es_query["query"] = {
              "bool": {
                  "must": [query_clause],
                  "filter": filter_clauses
              }
          }
      else:
          es_query["query"] = query_clause
  else:
      es_query["query"] = query_clause
  
  # KNN在外层,filter不作用于它
  if enable_knn and query_vector is not None:
      es_query["knn"] = {...}
  ```
  
  修改为(方案C):
  
  ```python
  # 构建内层bool: 文本和KNN二选一
  inner_bool_should = [query_clause]
  
  # 如果启用KNN,添加到should
  if enable_knn and query_vector is not None and self.text_embedding_field:
      knn_query = {
          "knn": {
              "field": self.text_embedding_field,
              "query_vector": query_vector.tolist(),
              "k": knn_k,
              "num_candidates": knn_num_candidates
          }
      }
      inner_bool_should.append(knn_query)
  
  # 构建内层bool结构
  inner_bool = {
      "bool": {
          "should": inner_bool_should,
          "minimum_should_match": 1
      }
  }
  
  # 构建外层bool: 包含filter
  filter_clauses = self._build_filters(filters, range_filters) if (filters or range_filters) else []
  
  outer_bool = {
      "bool": {
          "must": [inner_bool]
      }
  }
  
  if filter_clauses:
      outer_bool["bool"]["filter"] = filter_clauses
  
  # 包裹function_score
  function_score_query = {
      "function_score": {
          "query": outer_bool,
          "functions": self._build_score_functions(),
          "score_mode": "sum",
          "boost_mode": "multiply"
      }
  }
  
  es_query = {
      "size": size,
      "from": from_,
      "query": function_score_query
  }
  
  if min_score is not None:
      es_query["min_score"] = min_score
  ```
  
  **新增 `_build_score_functions` 方法**
  
  ```python
  def _build_score_functions(self) -> List[Dict[str, Any]]:
      """
      构建function_score的打分函数列表
      
      Returns:
          打分函数列表
      """
      functions = []
      
      # 时效性加权:最近更新的商品得分更高
      functions.append({
          "filter": {
              "range": {
                  "days_since_last_update": {"lte": 30}
              }
          },
          "weight": 1.1
      })
      
      # 可以添加更多打分因子
      # functions.append({
      #     "filter": {"term": {"is_video": True}},
      #     "weight": 1.05
      # })
      
      return functions
  ```
  
  #### `/home/tw/SearchEngine/search/ranking_engine.py` 
  
  **重命名为** `/home/tw/SearchEngine/search/rerank_engine.py`
  
  **修改类名和文档**
  
  ```python
  """
  Reranking engine for post-processing search result scoring.
  
  本地重排引擎,用于ES返回结果后的二次排序。
  当前状态:已禁用,优先使用ES的function_score。
  """
  
  class RerankEngine:
      """
      本地重排引擎(当前禁用)
      
      功能:对ES返回的结果进行二次打分和排序
      用途:复杂的自定义排序逻辑、实时个性化等
      """
      
      def __init__(self, ranking_expression: str, enabled: bool = False):
          self.enabled = enabled
          self.ranking_expression = ranking_expression
          if enabled:
              self.parsed_terms = self._parse_expression(ranking_expression)
  ```
  
  #### `/home/tw/SearchEngine/search/__init__.py`
  
  更新导入:
  
  ```python
  from .rerank_engine import RerankEngine  # 原 RankingEngine
  ```
  
  #### `/home/tw/SearchEngine/search/searcher.py`
  
  **修改初始化**(约88行):
  
  ```python
  # 改为RerankEngine,默认禁用
  self.rerank_engine = RerankEngine(
      config.ranking.expression,
      enabled=False  # 暂时禁用
  )
  ```
  
  **修改search方法中的rerank逻辑**(约356-383行):
  
  ```python
  # 应用本地重排(如果启用)
  if enable_rerank and self.rerank_engine.enabled:
      base_score = hit.get('_score') or 0.0
      knn_score = None
      
      # 检查是否使用了KNN
      if 'knn' in es_query.get('query', {}).get('function_score', {}).get('query', {}).get('bool', {}).get('must', [{}])[0].get('bool', {}).get('should', []):
          knn_score = base_score * 0.2
      
      custom_score = self.rerank_engine.calculate_score(
          hit,
          base_score,
          knn_score
      )
      result_doc['_custom_score'] = custom_score
      result_doc['_original_score'] = base_score
  
  hits.append(result_doc)
  
  # 重排序(仅当启用时)
  if enable_rerank and self.rerank_engine.enabled:
      hits.sort(key=lambda x: x.get('_custom_score', x['_score']), reverse=True)
      context.logger.info(
          f"本地重排完成 | 使用RerankEngine",
          extra={'reqid': context.reqid, 'uid': context.uid}
      )
  ```
  
ae5a294d   tangwang   命名修改、代码清理
248
  #### `/home/tw/SearchEngine/config/schema/tenant1/config.yaml`
43f1139f   tangwang   refactor: ES查询结构重...
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
  
  **添加配置项**(254行后):
  
  ```yaml
  # Ranking Configuration
  ranking:
    expression: "bm25() + 0.2*text_embedding_relevance()"
    description: "BM25 text relevance combined with semantic embedding similarity"
  
  # Reranking Configuration (本地重排)
  rerank:
    enabled: false
    expression: "bm25() + 0.2*text_embedding_relevance() + general_score*2"
    description: "Local reranking with custom scoring (currently disabled)"
  
  # Function Score Configuration (ES层打分)
  function_score:
    enabled: true
    functions:
      - name: "timeliness"
        type: "filter_weight"
        filter:
          range:
            days_since_last_update:
              lte: 30
        weight: 1.1
  ```
  
ae5a294d   tangwang   命名修改、代码清理
277
  #### `/home/tw/SearchEngine/config/tenant_config.py`
43f1139f   tangwang   refactor: ES查询结构重...
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
  
  **更新配置类**
  
  ```python
  @dataclass
  class RerankConfig:
      """本地重排配置"""
      enabled: bool = False
      expression: str = ""
      description: str = ""
  
  @dataclass
  class FunctionScoreConfig:
      """ES Function Score配置"""
      enabled: bool = True
      functions: List[Dict[str, Any]] = field(default_factory=list)
  
  @dataclass
ae5a294d   tangwang   命名修改、代码清理
296
  class TenantConfig:
43f1139f   tangwang   refactor: ES查询结构重...
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
      # ... 其他字段 ...
      ranking: RankingConfig  # 保留用于兼容
      rerank: RerankConfig  # 新增
      function_score: FunctionScoreConfig  # 新增
  ```
  
  ### 3. 测试验证
  
  **测试用例**
  
  1. 测试filter是否作用于文本查询结果
  2. 测试filter是否作用于KNN召回结果
  3. 测试只有文本匹配的情况
  4. 测试只有KNN匹配的情况
  5. 测试文本+KNN都匹配的情况
  6. 测试function_score打分是否生效
  
  **验证命令**
  
  ```bash
  curl -X POST http://localhost:6002/search/ \
    -H "Content-Type: application/json" \
    -d '{
      "query": "玩具",
      "filters": {"categoryName_keyword": "桌面休闲玩具"},
      "debug": true
    }'
  ```
  
  检查返回的`debug_info.es_query`结构是否正确。
  
  ### 4. 配置迁移
  
  对于现有的`ranking.expression`配置,建议:
  
  - 保留`ranking`配置用于文档说明
  - 新增`rerank.enabled=false`明确禁用状态
  - 新增`function_score`配置用于ES层打分
  
  ### 5. 后续优化空间
  
  - 根据业务需求添加更多function_score因子
  - 未来如需复杂个性化排序,可启用RerankEngine
  - 考虑使用ES的RRF(Reciprocal Rank Fusion)算法
  - 添加A/B测试框架对比不同排序策略
  
  ## 实施步骤
  
  1. 修改`multilang_query_builder.py`的查询构建逻辑
  2. 重命名`ranking_engine.py`为`rerank_engine.py`
  3. 更新`searcher.py`的调用
  4. 更新配置文件
  5. 运行测试验证
  6. 更新文档