es-query-25a9f060.plan.检索表达式优化.ES_function表达式.md
8.38 KB
<!-- 25a9f060-257b-486f-b598-bbb062d1adf9 c200c78c-4d12-4062-865a-fa2adf92bdd9 -->
ES查询结构重构与排序优化
核心改动
1. ES查询结构优化(方案C)
目标结构:
{
"query": {
"function_score": {
"query": {
"bool": {
"must": [
{
"bool": {
"should": [
{ "multi_match": {...} }, // 文本查询
{ "knn": {...} } // KNN查询
],
"minimum_should_match": 1 // 至少匹配一个
}
}
],
"filter": [...] // 过滤器作用于整体
}
},
"functions": [
{
"filter": {"range": {"days_since_last_update": {"lte": 30}}},
"weight": 1.1
}
],
"score_mode": "sum",
"boost_mode": "multiply"
}
}
}
关键改进:
- filter在外层bool中,同时作用于文本和KNN
- 文本和KNN在should中,minimum_should_match=1确保至少匹配一个
- function_score包裹整体,支持额外打分因子
2. 文件修改清单
/home/tw/SearchEngine/search/multilang_query_builder.py
修改 build_multilang_query 方法(约156-190行):
当前代码(有问题):
es_query = {"size": size, "from": from_}
if filters or range_filters:
filter_clauses = self._build_filters(filters, range_filters)
if filter_clauses:
es_query["query"] = {
"bool": {
"must": [query_clause],
"filter": filter_clauses
}
}
else:
es_query["query"] = query_clause
else:
es_query["query"] = query_clause
# KNN在外层,filter不作用于它
if enable_knn and query_vector is not None:
es_query["knn"] = {...}
修改为(方案C):
# 构建内层bool: 文本和KNN二选一
inner_bool_should = [query_clause]
# 如果启用KNN,添加到should
if enable_knn and query_vector is not None and self.text_embedding_field:
knn_query = {
"knn": {
"field": self.text_embedding_field,
"query_vector": query_vector.tolist(),
"k": knn_k,
"num_candidates": knn_num_candidates
}
}
inner_bool_should.append(knn_query)
# 构建内层bool结构
inner_bool = {
"bool": {
"should": inner_bool_should,
"minimum_should_match": 1
}
}
# 构建外层bool: 包含filter
filter_clauses = self._build_filters(filters, range_filters) if (filters or range_filters) else []
outer_bool = {
"bool": {
"must": [inner_bool]
}
}
if filter_clauses:
outer_bool["bool"]["filter"] = filter_clauses
# 包裹function_score
function_score_query = {
"function_score": {
"query": outer_bool,
"functions": self._build_score_functions(),
"score_mode": "sum",
"boost_mode": "multiply"
}
}
es_query = {
"size": size,
"from": from_,
"query": function_score_query
}
if min_score is not None:
es_query["min_score"] = min_score
新增 _build_score_functions 方法:
def _build_score_functions(self) -> List[Dict[str, Any]]:
"""
构建function_score的打分函数列表
Returns:
打分函数列表
"""
functions = []
# 时效性加权:最近更新的商品得分更高
functions.append({
"filter": {
"range": {
"days_since_last_update": {"lte": 30}
}
},
"weight": 1.1
})
# 可以添加更多打分因子
# functions.append({
# "filter": {"term": {"is_video": True}},
# "weight": 1.05
# })
return functions
/home/tw/SearchEngine/search/ranking_engine.py
重命名为 /home/tw/SearchEngine/search/rerank_engine.py
修改类名和文档:
"""
Reranking engine for post-processing search result scoring.
本地重排引擎,用于ES返回结果后的二次排序。
当前状态:已禁用,优先使用ES的function_score。
"""
class RerankEngine:
"""
本地重排引擎(当前禁用)
功能:对ES返回的结果进行二次打分和排序
用途:复杂的自定义排序逻辑、实时个性化等
"""
def __init__(self, ranking_expression: str, enabled: bool = False):
self.enabled = enabled
self.ranking_expression = ranking_expression
if enabled:
self.parsed_terms = self._parse_expression(ranking_expression)
/home/tw/SearchEngine/search/__init__.py
更新导入:
from .rerank_engine import RerankEngine # 原 RankingEngine
/home/tw/SearchEngine/search/searcher.py
修改初始化(约88行):
# 改为RerankEngine,默认禁用
self.rerank_engine = RerankEngine(
config.ranking.expression,
enabled=False # 暂时禁用
)
修改search方法中的rerank逻辑(约356-383行):
# 应用本地重排(如果启用)
if enable_rerank and self.rerank_engine.enabled:
base_score = hit.get('_score') or 0.0
knn_score = None
# 检查是否使用了KNN
if 'knn' in es_query.get('query', {}).get('function_score', {}).get('query', {}).get('bool', {}).get('must', [{}])[0].get('bool', {}).get('should', []):
knn_score = base_score * 0.2
custom_score = self.rerank_engine.calculate_score(
hit,
base_score,
knn_score
)
result_doc['_custom_score'] = custom_score
result_doc['_original_score'] = base_score
hits.append(result_doc)
# 重排序(仅当启用时)
if enable_rerank and self.rerank_engine.enabled:
hits.sort(key=lambda x: x.get('_custom_score', x['_score']), reverse=True)
context.logger.info(
f"本地重排完成 | 使用RerankEngine",
extra={'reqid': context.reqid, 'uid': context.uid}
)
/home/tw/SearchEngine/config/schema/tenant1/config.yaml
添加配置项(254行后):
# Ranking Configuration
ranking:
expression: "bm25() + 0.2*text_embedding_relevance()"
description: "BM25 text relevance combined with semantic embedding similarity"
# Reranking Configuration (本地重排)
rerank:
enabled: false
expression: "bm25() + 0.2*text_embedding_relevance() + general_score*2"
description: "Local reranking with custom scoring (currently disabled)"
# Function Score Configuration (ES层打分)
function_score:
enabled: true
functions:
- name: "timeliness"
type: "filter_weight"
filter:
range:
days_since_last_update:
lte: 30
weight: 1.1
/home/tw/SearchEngine/config/tenant_config.py
更新配置类:
@dataclass
class RerankConfig:
"""本地重排配置"""
enabled: bool = False
expression: str = ""
description: str = ""
@dataclass
class FunctionScoreConfig:
"""ES Function Score配置"""
enabled: bool = True
functions: List[Dict[str, Any]] = field(default_factory=list)
@dataclass
class TenantConfig:
# ... 其他字段 ...
ranking: RankingConfig # 保留用于兼容
rerank: RerankConfig # 新增
function_score: FunctionScoreConfig # 新增
3. 测试验证
测试用例:
- 测试filter是否作用于文本查询结果
- 测试filter是否作用于KNN召回结果
- 测试只有文本匹配的情况
- 测试只有KNN匹配的情况
- 测试文本+KNN都匹配的情况
- 测试function_score打分是否生效
验证命令:
curl -X POST http://localhost:6002/search/ \
-H "Content-Type: application/json" \
-d '{
"query": "玩具",
"filters": {"categoryName_keyword": "桌面休闲玩具"},
"debug": true
}'
检查返回的debug_info.es_query结构是否正确。
4. 配置迁移
对于现有的ranking.expression配置,建议:
- 保留
ranking配置用于文档说明 - 新增
rerank.enabled=false明确禁用状态 - 新增
function_score配置用于ES层打分
5. 后续优化空间
- 根据业务需求添加更多function_score因子
- 未来如需复杂个性化排序,可启用RerankEngine
- 考虑使用ES的RRF(Reciprocal Rank Fusion)算法
- 添加A/B测试框架对比不同排序策略
实施步骤
- 修改
multilang_query_builder.py的查询构建逻辑 - 重命名
ranking_engine.py为rerank_engine.py - 更新
searcher.py的调用 - 更新配置文件
- 运行测试验证
- 更新文档