es-query-25a9f060.plan.检索表达式优化.ES_function表达式.md 8.38 KB

<!-- 25a9f060-257b-486f-b598-bbb062d1adf9 c200c78c-4d12-4062-865a-fa2adf92bdd9 -->

ES查询结构重构与排序优化

核心改动

1. ES查询结构优化(方案C)

目标结构

{
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "must": [
            {
              "bool": {
                "should": [
                  { "multi_match": {...} },  // 文本查询
                  { "knn": {...} }  // KNN查询
                ],
                "minimum_should_match": 1  // 至少匹配一个
              }
            }
          ],
          "filter": [...]  // 过滤器作用于整体
        }
      },
      "functions": [
        {
          "filter": {"range": {"days_since_last_update": {"lte": 30}}},
          "weight": 1.1
        }
      ],
      "score_mode": "sum",
      "boost_mode": "multiply"
    }
  }
}

关键改进

  • filter在外层bool中,同时作用于文本和KNN
  • 文本和KNN在should中,minimum_should_match=1确保至少匹配一个
  • function_score包裹整体,支持额外打分因子

2. 文件修改清单

/home/tw/SearchEngine/search/multilang_query_builder.py

修改 build_multilang_query 方法(约156-190行):

当前代码(有问题):

es_query = {"size": size, "from": from_}

if filters or range_filters:
    filter_clauses = self._build_filters(filters, range_filters)
    if filter_clauses:
        es_query["query"] = {
            "bool": {
                "must": [query_clause],
                "filter": filter_clauses
            }
        }
    else:
        es_query["query"] = query_clause
else:
    es_query["query"] = query_clause

# KNN在外层,filter不作用于它
if enable_knn and query_vector is not None:
    es_query["knn"] = {...}

修改为(方案C):

# 构建内层bool: 文本和KNN二选一
inner_bool_should = [query_clause]

# 如果启用KNN,添加到should
if enable_knn and query_vector is not None and self.text_embedding_field:
    knn_query = {
        "knn": {
            "field": self.text_embedding_field,
            "query_vector": query_vector.tolist(),
            "k": knn_k,
            "num_candidates": knn_num_candidates
        }
    }
    inner_bool_should.append(knn_query)

# 构建内层bool结构
inner_bool = {
    "bool": {
        "should": inner_bool_should,
        "minimum_should_match": 1
    }
}

# 构建外层bool: 包含filter
filter_clauses = self._build_filters(filters, range_filters) if (filters or range_filters) else []

outer_bool = {
    "bool": {
        "must": [inner_bool]
    }
}

if filter_clauses:
    outer_bool["bool"]["filter"] = filter_clauses

# 包裹function_score
function_score_query = {
    "function_score": {
        "query": outer_bool,
        "functions": self._build_score_functions(),
        "score_mode": "sum",
        "boost_mode": "multiply"
    }
}

es_query = {
    "size": size,
    "from": from_,
    "query": function_score_query
}

if min_score is not None:
    es_query["min_score"] = min_score

新增 _build_score_functions 方法

def _build_score_functions(self) -> List[Dict[str, Any]]:
    """
    构建function_score的打分函数列表

    Returns:
        打分函数列表
    """
    functions = []

    # 时效性加权:最近更新的商品得分更高
    functions.append({
        "filter": {
            "range": {
                "days_since_last_update": {"lte": 30}
            }
        },
        "weight": 1.1
    })

    # 可以添加更多打分因子
    # functions.append({
    #     "filter": {"term": {"is_video": True}},
    #     "weight": 1.05
    # })

    return functions

/home/tw/SearchEngine/search/ranking_engine.py

重命名为 /home/tw/SearchEngine/search/rerank_engine.py

修改类名和文档

"""
Reranking engine for post-processing search result scoring.

本地重排引擎,用于ES返回结果后的二次排序。
当前状态:已禁用,优先使用ES的function_score。
"""

class RerankEngine:
    """
    本地重排引擎(当前禁用)

    功能:对ES返回的结果进行二次打分和排序
    用途:复杂的自定义排序逻辑、实时个性化等
    """

    def __init__(self, ranking_expression: str, enabled: bool = False):
        self.enabled = enabled
        self.ranking_expression = ranking_expression
        if enabled:
            self.parsed_terms = self._parse_expression(ranking_expression)

/home/tw/SearchEngine/search/__init__.py

更新导入:

from .rerank_engine import RerankEngine  # 原 RankingEngine

/home/tw/SearchEngine/search/searcher.py

修改初始化(约88行):

# 改为RerankEngine,默认禁用
self.rerank_engine = RerankEngine(
    config.ranking.expression,
    enabled=False  # 暂时禁用
)

修改search方法中的rerank逻辑(约356-383行):

# 应用本地重排(如果启用)
if enable_rerank and self.rerank_engine.enabled:
    base_score = hit.get('_score') or 0.0
    knn_score = None

    # 检查是否使用了KNN
    if 'knn' in es_query.get('query', {}).get('function_score', {}).get('query', {}).get('bool', {}).get('must', [{}])[0].get('bool', {}).get('should', []):
        knn_score = base_score * 0.2

    custom_score = self.rerank_engine.calculate_score(
        hit,
        base_score,
        knn_score
    )
    result_doc['_custom_score'] = custom_score
    result_doc['_original_score'] = base_score

hits.append(result_doc)

# 重排序(仅当启用时)
if enable_rerank and self.rerank_engine.enabled:
    hits.sort(key=lambda x: x.get('_custom_score', x['_score']), reverse=True)
    context.logger.info(
        f"本地重排完成 | 使用RerankEngine",
        extra={'reqid': context.reqid, 'uid': context.uid}
    )

/home/tw/SearchEngine/config/schema/customer1/config.yaml

添加配置项(254行后):

# Ranking Configuration
ranking:
  expression: "bm25() + 0.2*text_embedding_relevance()"
  description: "BM25 text relevance combined with semantic embedding similarity"

# Reranking Configuration (本地重排)
rerank:
  enabled: false
  expression: "bm25() + 0.2*text_embedding_relevance() + general_score*2"
  description: "Local reranking with custom scoring (currently disabled)"

# Function Score Configuration (ES层打分)
function_score:
  enabled: true
  functions:
    - name: "timeliness"
      type: "filter_weight"
      filter:
        range:
          days_since_last_update:
            lte: 30
      weight: 1.1

/home/tw/SearchEngine/config/customer_config.py

更新配置类

@dataclass
class RerankConfig:
    """本地重排配置"""
    enabled: bool = False
    expression: str = ""
    description: str = ""

@dataclass
class FunctionScoreConfig:
    """ES Function Score配置"""
    enabled: bool = True
    functions: List[Dict[str, Any]] = field(default_factory=list)

@dataclass
class CustomerConfig:
    # ... 其他字段 ...
    ranking: RankingConfig  # 保留用于兼容
    rerank: RerankConfig  # 新增
    function_score: FunctionScoreConfig  # 新增

3. 测试验证

测试用例

  1. 测试filter是否作用于文本查询结果
  2. 测试filter是否作用于KNN召回结果
  3. 测试只有文本匹配的情况
  4. 测试只有KNN匹配的情况
  5. 测试文本+KNN都匹配的情况
  6. 测试function_score打分是否生效

验证命令

curl -X POST http://localhost:6002/search/ \
  -H "Content-Type: application/json" \
  -d '{
    "query": "玩具",
    "filters": {"categoryName_keyword": "桌面休闲玩具"},
    "debug": true
  }'

检查返回的debug_info.es_query结构是否正确。

4. 配置迁移

对于现有的ranking.expression配置,建议:

  • 保留ranking配置用于文档说明
  • 新增rerank.enabled=false明确禁用状态
  • 新增function_score配置用于ES层打分

5. 后续优化空间

  • 根据业务需求添加更多function_score因子
  • 未来如需复杂个性化排序,可启用RerankEngine
  • 考虑使用ES的RRF(Reciprocal Rank Fusion)算法
  • 添加A/B测试框架对比不同排序策略

实施步骤

  1. 修改multilang_query_builder.py的查询构建逻辑
  2. 重命名ranking_engine.pyrerank_engine.py
  3. 更新searcher.py的调用
  4. 更新配置文件
  5. 运行测试验证
  6. 更新文档