# ES查询结构重构与排序优化 ## 核心改动 ### 1. ES查询结构优化(方案C) **目标结构**: ```json { "query": { "function_score": { "query": { "bool": { "must": [ { "bool": { "should": [ { "multi_match": {...} }, // 文本查询 { "knn": {...} } // KNN查询 ], "minimum_should_match": 1 // 至少匹配一个 } } ], "filter": [...] // 过滤器作用于整体 } }, "functions": [ { "filter": {"range": {"days_since_last_update": {"lte": 30}}}, "weight": 1.1 } ], "score_mode": "sum", "boost_mode": "multiply" } } } ``` **关键改进**: - filter在外层bool中,同时作用于文本和KNN - 文本和KNN在should中,minimum_should_match=1确保至少匹配一个 - function_score包裹整体,支持额外打分因子 ### 2. 文件修改清单 #### `/home/tw/SearchEngine/search/multilang_query_builder.py` **修改 `build_multilang_query` 方法**(约156-190行): 当前代码(有问题): ```python es_query = {"size": size, "from": from_} if filters or range_filters: filter_clauses = self._build_filters(filters, range_filters) if filter_clauses: es_query["query"] = { "bool": { "must": [query_clause], "filter": filter_clauses } } else: es_query["query"] = query_clause else: es_query["query"] = query_clause # KNN在外层,filter不作用于它 if enable_knn and query_vector is not None: es_query["knn"] = {...} ``` 修改为(方案C): ```python # 构建内层bool: 文本和KNN二选一 inner_bool_should = [query_clause] # 如果启用KNN,添加到should if enable_knn and query_vector is not None and self.text_embedding_field: knn_query = { "knn": { "field": self.text_embedding_field, "query_vector": query_vector.tolist(), "k": knn_k, "num_candidates": knn_num_candidates } } inner_bool_should.append(knn_query) # 构建内层bool结构 inner_bool = { "bool": { "should": inner_bool_should, "minimum_should_match": 1 } } # 构建外层bool: 包含filter filter_clauses = self._build_filters(filters, range_filters) if (filters or range_filters) else [] outer_bool = { "bool": { "must": [inner_bool] } } if filter_clauses: outer_bool["bool"]["filter"] = filter_clauses # 包裹function_score function_score_query = { "function_score": { "query": outer_bool, "functions": self._build_score_functions(), "score_mode": "sum", "boost_mode": "multiply" } } es_query = { "size": size, "from": from_, "query": function_score_query } if min_score is not None: es_query["min_score"] = min_score ``` **新增 `_build_score_functions` 方法**: ```python def _build_score_functions(self) -> List[Dict[str, Any]]: """ 构建function_score的打分函数列表 Returns: 打分函数列表 """ functions = [] # 时效性加权:最近更新的商品得分更高 functions.append({ "filter": { "range": { "days_since_last_update": {"lte": 30} } }, "weight": 1.1 }) # 可以添加更多打分因子 # functions.append({ # "filter": {"term": {"is_video": True}}, # "weight": 1.05 # }) return functions ``` #### `/home/tw/SearchEngine/search/ranking_engine.py` **重命名为** `/home/tw/SearchEngine/search/rerank_engine.py` **修改类名和文档**: ```python """ Reranking engine for post-processing search result scoring. 本地重排引擎,用于ES返回结果后的二次排序。 当前状态:已禁用,优先使用ES的function_score。 """ class RerankEngine: """ 本地重排引擎(当前禁用) 功能:对ES返回的结果进行二次打分和排序 用途:复杂的自定义排序逻辑、实时个性化等 """ def __init__(self, ranking_expression: str, enabled: bool = False): self.enabled = enabled self.ranking_expression = ranking_expression if enabled: self.parsed_terms = self._parse_expression(ranking_expression) ``` #### `/home/tw/SearchEngine/search/__init__.py` 更新导入: ```python from .rerank_engine import RerankEngine # 原 RankingEngine ``` #### `/home/tw/SearchEngine/search/searcher.py` **修改初始化**(约88行): ```python # 改为RerankEngine,默认禁用 self.rerank_engine = RerankEngine( config.ranking.expression, enabled=False # 暂时禁用 ) ``` **修改search方法中的rerank逻辑**(约356-383行): ```python # 应用本地重排(如果启用) if enable_rerank and self.rerank_engine.enabled: base_score = hit.get('_score') or 0.0 knn_score = None # 检查是否使用了KNN if 'knn' in es_query.get('query', {}).get('function_score', {}).get('query', {}).get('bool', {}).get('must', [{}])[0].get('bool', {}).get('should', []): knn_score = base_score * 0.2 custom_score = self.rerank_engine.calculate_score( hit, base_score, knn_score ) result_doc['_custom_score'] = custom_score result_doc['_original_score'] = base_score hits.append(result_doc) # 重排序(仅当启用时) if enable_rerank and self.rerank_engine.enabled: hits.sort(key=lambda x: x.get('_custom_score', x['_score']), reverse=True) context.logger.info( f"本地重排完成 | 使用RerankEngine", extra={'reqid': context.reqid, 'uid': context.uid} ) ``` #### `/home/tw/SearchEngine/config/schema/tenant1/config.yaml` **添加配置项**(254行后): ```yaml # Ranking Configuration ranking: expression: "bm25() + 0.2*text_embedding_relevance()" description: "BM25 text relevance combined with semantic embedding similarity" # Reranking Configuration (本地重排) rerank: enabled: false expression: "bm25() + 0.2*text_embedding_relevance() + general_score*2" description: "Local reranking with custom scoring (currently disabled)" # Function Score Configuration (ES层打分) function_score: enabled: true functions: - name: "timeliness" type: "filter_weight" filter: range: days_since_last_update: lte: 30 weight: 1.1 ``` #### `/home/tw/SearchEngine/config/tenant_config.py` **更新配置类**: ```python @dataclass class RerankConfig: """本地重排配置""" enabled: bool = False expression: str = "" description: str = "" @dataclass class FunctionScoreConfig: """ES Function Score配置""" enabled: bool = True functions: List[Dict[str, Any]] = field(default_factory=list) @dataclass class TenantConfig: # ... 其他字段 ... ranking: RankingConfig # 保留用于兼容 rerank: RerankConfig # 新增 function_score: FunctionScoreConfig # 新增 ``` ### 3. 测试验证 **测试用例**: 1. 测试filter是否作用于文本查询结果 2. 测试filter是否作用于KNN召回结果 3. 测试只有文本匹配的情况 4. 测试只有KNN匹配的情况 5. 测试文本+KNN都匹配的情况 6. 测试function_score打分是否生效 **验证命令**: ```bash curl -X POST http://localhost:6002/search/ \ -H "Content-Type: application/json" \ -d '{ "query": "玩具", "filters": {"categoryName_keyword": "桌面休闲玩具"}, "debug": true }' ``` 检查返回的`debug_info.es_query`结构是否正确。 ### 4. 配置迁移 对于现有的`ranking.expression`配置,建议: - 保留`ranking`配置用于文档说明 - 新增`rerank.enabled=false`明确禁用状态 - 新增`function_score`配置用于ES层打分 ### 5. 后续优化空间 - 根据业务需求添加更多function_score因子 - 未来如需复杂个性化排序,可启用RerankEngine - 考虑使用ES的RRF(Reciprocal Rank Fusion)算法 - 添加A/B测试框架对比不同排序策略 ## 实施步骤 1. 修改`multilang_query_builder.py`的查询构建逻辑 2. 重命名`ranking_engine.py`为`rerank_engine.py` 3. 更新`searcher.py`的调用 4. 更新配置文件 5. 运行测试验证 6. 更新文档