Commit c34254296248008e62b536e42625b4dee05e9953
1 parent
daa2690b
在以下文件中完成精排/融合清理工作:[search/rerank_client.py](/data/saas-search/search/rerank_clie…
…nt.py)、[search/searcher.py](/data/saas-search/search/searcher.py)、[frontend/static/js/app.js](/data/saas-search/frontend/static/js/app.js) 以及 [tests/test_rerank_client.py](/data/saas-search/tests/test_rerank_client.py)。 主要修复内容如下: - 精排现依据融合阶段得分进行排序,而非仅依据原始的 `fine_score`。 - 最终重排不再依赖独立的 `fine_scores` 数组(该数组在精排排序后可能产生同步偏差),而是直接读取命中结果附带的 `_fine_score`。 - 精排与最终重排现均通过同一计算路径生成融合调试信息,该路径同时也决定实际排序结果,从而保证记录逻辑与生产逻辑保持一致。 - 调试信息载荷更加清晰:精排和最终重排阶段都会暴露融合输入/因子以及规范的 `fusion_summary`,前端界面现在会渲染该摘要信息。 主要问题:阶段逻辑重复且存在并行的数据通道:一个通道用于计算排序,另一个通道用于组装调试字段,还有第三个通道用于传递辅助数组。这造成了潜在的差异风险。本次重构通过将阶段得分作为唯一事实来源,并让调试/前端直接消费其输出而非事后重构,降低了该风险。 验证结果: - `./.venv/bin/python -m pytest -q tests/test_rerank_client.py tests/test_search_rerank_window.py` - `./.venv/bin/python -m py_compile search/rerank_client.py search/searcher.py` 结果:`22 passed`。 当前的主流程: 1. Query 解析 2. ES 召回 3. 粗排:只用 ES 内部文本/KNN 信号 4. 款式 SKU 选择 + title suffix 5. 精排:轻量 reranker + 文本/KNN 融合 6. 最终 rerank:重 reranker + fine score + 文本/KNN 融合 7. 分页、补全字段、格式化返回 主控代码在 [searcher.py](/data/saas-search/search/searcher.py),打分与 rerank 细节在 [rerank_client.py](/data/saas-search/search/rerank_client.py),配置定义在 [schema.py](/data/saas-search/config/schema.py) 和 [config.yaml](/data/saas-search/config/config.yaml)。 **先看入口怎么决定走哪条路** 在 [searcher.py:348](/data/saas-search/search/searcher.py#L348) 开始,`search()` 先读租户语言、开关、窗口大小。 关键判断在 [searcher.py:364](/data/saas-search/search/searcher.py#L364) 到 [searcher.py:372](/data/saas-search/search/searcher.py#L372): - `rerank_window` 现在是 80,见 [config.yaml:256](/data/saas-search/config/config.yaml#L256) - `coarse_rank.input_window` 是 700,`output_window` 是 240,见 [config.yaml:231](/data/saas-search/config/config.yaml#L231) - `fine_rank.input_window` 是 240,`output_window` 是 80,见 [config.yaml:245](/data/saas-search/config/config.yaml#L245) 所以如果请求满足 `from_ + size <= rerank_window`,就进入完整漏斗: - ES 实际取前 `700` - 粗排后留 `240` - 精排后留 `80` - 最终 rerank 也只处理这 `80` - 最后再做分页切片 如果请求页超出 80,就不走后面的多阶段漏斗,直接按 ES 原逻辑返回。
Showing
10 changed files
with
1421 additions
and
595 deletions
Show diff stats
config/config.yaml
docs/issue-2026-03-27-添加粗排精排-第2轮-done-0327.txt
| ... | ... | @@ -20,7 +20,7 @@ config.yaml |
| 20 | 20 | 现在链路是: |
| 21 | 21 | ES 拉取 coarse_rank.input_window 条 -> 粗排按 text/knn 融合裁到 coarse_rank.output_window -> 再做 SKU 选择和 title suffix -> 精排调用轻量 reranker 裁到 fine_rank.output_window -> 最终重排调用现有 reranker,并在最终融合里加入 fine_score。同时把 reranker client/provider 改成了按 service_profile 选不同 service_url,这样 fine/final 可以共用同一套服务代码,只起不同实例。 |
| 22 | 22 | |
| 23 | -你是用了./.venv/bin/pytest -q tests/test_rerank_client.py tests/test_search_rerank_window.py 进行验证,结果是 18 passed。也按要求尝试了 ./scripts/service_ctl.sh restart reranker backend 做联调,但当前环境里只有 6007 的现有 reranker,配置里的 fine 端口 6009 还没对应实例,而且服务重启后进程状态不太稳定,没能完成完整线上链路验证。当前最关键的落地前置是把轻量 bge 实例补起来,再用同样脚本重启联调。 | |
| 23 | +你用了./.venv/bin/pytest -q tests/test_rerank_client.py tests/test_search_rerank_window.py 进行验证,结果是 18 passed。也按要求尝试了 ./scripts/service_ctl.sh restart reranker backend 做联调,但当前环境里只有 6007 的现有 reranker,配置里的 fine 端口 6009 还没对应实例,而且服务重启后进程状态不太稳定,没能完成完整线上链路验证。当前最关键的落地前置是把轻量 bge 实例补起来,再用同样脚本重启联调。 | |
| 24 | 24 | |
| 25 | 25 | 因此,需要回顾这个漏斗的代码,进行进一步的优化: |
| 26 | 26 | 1. Fine Rank环节的信息都是N/A。fine 实例也需要在config.yaml中体现,并且需要被server_ctl.sh管理,参考现在是怎么管理reranker的。fine和reranker分别调用哪个模型(两者代码功能,但是只是启动不同的实例),请你深度思考后决定。这里是需要参考当前代码、根据实际需要(原来只有一个reranker、现在多了一个精排,也要调用一个重排模型,因此带来的必要的部分重构。 | ... | ... |
| ... | ... | @@ -0,0 +1,98 @@ |
| 1 | +这是上一轮检索效果优化的需求: | |
| 2 | +参考 | |
| 3 | +searcher.py | |
| 4 | +rerank_client.py | |
| 5 | +schema.py | |
| 6 | +es_query_builder.py | |
| 7 | +config.yaml | |
| 8 | +相关性检索优化说明.md | |
| 9 | + | |
| 10 | +在ES返回到rerank期间增加一轮粗排+一轮精排。 | |
| 11 | +1. ES召回,600 | |
| 12 | +2. 粗排:600->240。配置文件增加粗排相关配置,包括输入条数(配置为700,ES拉取的条数改为粗排输入条数),然后增加粗排的融合公式配置,参考现有的reranker融合公式即可、只是去掉其中的重排模型项。 | |
| 13 | +3. 现在的sku选择、为reranker生成title后缀这一套逻辑,是放在粗排后,因为精排也是一个reranker模型(只不过是一个轻量级的,bge-reranker),需要用这个title后缀。 | |
| 14 | +4. 精排:240-80,使用bge-reranker,但是,因为reranker只能选一个backend,考虑如何重构。现在,精排也是一个独立的进程、独立提供端口,服务。但是,因为跟重排逻辑是一致的(即使有部分不一致也应该分离不一致的点进行配置化),所以共用代码,只是根据需要启动两个实例,避免代码冗余。 | |
| 15 | +5. 重排:80,也是用当前的重排代码,调用单独的实例(即现在使用的实例),返回后,经过融合公式,到分页,也参考现在的融合公式,但是,加入一项精排模型打分。 | |
| 16 | +测试时,使用跟我同样的环境./scripts/service_ctl.sh reranker backend 重启相关服务进行测试 | |
| 17 | + | |
| 18 | +你已经完成了一般修改,已把三段排序链路接上了,主改动在 search/searcher.py、search/rerank_client.py、config/schema.py、config/loader.py、config/services_config.py 和 config/config.yaml。 | |
| 19 | + | |
| 20 | +现在链路是: | |
| 21 | +ES 拉取 coarse_rank.input_window 条 -> 粗排按 text/knn 融合裁到 coarse_rank.output_window -> 再做 SKU 选择和 title suffix -> 精排调用轻量 reranker 裁到 fine_rank.output_window -> 最终重排调用现有 reranker,并在最终融合里加入 fine_score。同时把 reranker client/provider 改成了按 service_profile 选不同 service_url,这样 fine/final 可以共用同一套服务代码,只起不同实例。 | |
| 22 | + | |
| 23 | +并且,你对调试展示进行了重构。你已经把结果卡片和全局调试面板都改成按漏斗阶段取值和展示,在 app.js 里把 ES 召回、粗排、精排、最终 rerank 分开渲染了。 | |
| 24 | +现在每条结果的 debug 会按阶段展示: | |
| 25 | +ES 召回:rank、ES score、norm score、matched queries。 | |
| 26 | +粗排:rank/rank_change、coarse_score、text/knn 输入、text_source/text_translation/text_primary/text_support、text_knn/image_knn、factor。 | |
| 27 | +精排:rank/rank_change、fine_score、fine input。 | |
| 28 | +最终 rerank:rank/rank_change、rerank_score、text/knn score、各 factor、fused_score,以及完整 signals。 | |
| 29 | + | |
| 30 | +请你仔细阅读漏斗环节的这些代码,特别是关于打分、重排序、debug信息记录方面的。 | |
| 31 | + | |
| 32 | + | |
| 33 | +现在,请注意,需要优化的是: | |
| 34 | +1. Fine Rank环节似乎没有进行融合公式的计算、继而进行重排序,请修复。 | |
| 35 | +2.从软件工程的视角review代码: | |
| 36 | +因为增加了多重排序漏斗,数据的记录、传递,交互的接口,是否设计足够合理,存在哪些问题。 | |
| 37 | +请从软件工程的角度审视这些逻辑,是否有需要梳理、清理和重写的地方。 | |
| 38 | +3. Fine Rank和Final Rerank环节信息记录优化: | |
| 39 | +这两个环节都要体现融合公式的输入、关键因子、以及融合公式的得分。为了避免代码膨胀,Fine Rank和Final Rerank | |
| 40 | +都可以采用一个字符串记录这些关键信息,字符串内包括融合公式各项的名称和具体数值,以及最终结果。你也可以继续沿用当前的记录方式,需要你对比一下哪种代码量更少、更清晰简洁。 | |
| 41 | +也要仔细思考当前的代码,真实的计算和信息的记录,是否存在分离的情况,是否存在冗余和分叉。这种情况是不允许的,存在隐藏的风险,以后改了正式逻辑而没有改调试信息,将导致不一致。 | |
| 42 | +务必注意,当前已经有相关的信息记录逻辑,注意不要叠补丁,可以适当修改、或者清理重写,而不是新增,要使得代码更简洁和干净,并保证信息记录与真实逻辑一致。 | |
| 43 | + | |
| 44 | + | |
| 45 | +涉及代码较多,请耐心阅读,以上都是一些需要深度思考的任务,慢慢来,留足够多的时间来review和重新设计。 | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | +因为增加了两个环节,多了很多变量。 | |
| 53 | +以这些为效果评估的测试集,调试参数。这次的调整范围是,融合公式中的各个 | |
| 54 | +falda negra oficina | |
| 55 | +red fitted tee | |
| 56 | +黒いミディ丈スカート | |
| 57 | +黑色中长半身裙 | |
| 58 | +чёрное летнее платье | |
| 59 | +修身牛仔裤 | |
| 60 | +date night dress | |
| 61 | +vacation outfit dress | |
| 62 | +minimalist top | |
| 63 | + | |
| 64 | +仔细思考这些漏斗中重要的信息如何呈现。对应的修改前端代码。 | |
| 65 | +注意包括整体漏斗信息的呈现,以及每条结构中独自的信息。 | |
| 66 | +我需要这些信息,辅助各环节融合公式的调参,根据我的需求,深度思考该如何设计,要呈现哪些信息,如何呈现。 | |
| 67 | +可以对现有的逻辑做适当的重构,重新整理。 | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | +fine 实例也需要在config.yaml中体现,并且需要被server_ctl.sh管理,参考现在是怎么管理reranker的。fine和reranker分别调用哪个模型(两者代码功能,但是只是启动不同的实例),请你深度思考后决定。这里是需要参考当前代码、根据实际需要(原来只有一个reranker、现在多了一个精排,也要调用一个重排模型,因此带来的必要的部分重构。 | |
| 73 | + | |
| 74 | +1. Fine Rank环节的信息都是N/A,是没有配置吗。fine rank是使用bge-reranker,复用当前reranker模型的代码,但是需要单独起一个服务、单独加载一个模型。 | |
| 75 | +2. Ranking Funnel、Fusion Factors、Signal Breakdown | |
| 76 | +这些是不是整合起来、按漏斗收集、整理信息、以及进行呈现比较好。 | |
| 77 | +ES 召回的环节,展示Matched Queries各项打分、ES的总分、norm后打分、排序位置,等等关键信息 | |
| 78 | +粗排:粗排融合公式的各项输入、重要中间结果和参数、最后得分,排序位置以及上升/下降了多少。等等关键信息。 | |
| 79 | +精排:同样例举关键的输入、中间过程、输出、排序和位置变化等。 | |
| 80 | +reranker:类似 | |
| 81 | + | |
| 82 | +因为涉及的环节较多,非常要注意的一个点是:不要每次修改都在原来的基础上,为实现目标而打补丁,应该观察一下所涉及的代码现在是怎么做的,务必注意如何适当的清理掉现有逻辑,该如何对其进行修改,来达到目的,以达到代码的精简,避免冗余、分叉。 | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | +1. Fine Rank 这个环节没有体现融合公式的输入、关键因子、以及融合公式的得分。为了避免代码膨胀,Fine Rank和Final Rerank | |
| 89 | +都可以采用一个字符串记录这些关键信息,字符串内包括融合公式各项的名称和具体数值,以及最终结果。 | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | ... | ... |
| ... | ... | @@ -0,0 +1,314 @@ |
| 1 | +这是上一轮检索质量优化的需求说明: | |
| 2 | + | |
| 3 | +参考文件: | |
| 4 | +`searcher.py` | |
| 5 | +`rerank_client.py` | |
| 6 | +`schema.py` | |
| 7 | +`es_query_builder.py` | |
| 8 | +`config.yaml` | |
| 9 | +`相关性检索优化说明.md` | |
| 10 | + | |
| 11 | +在ES返回结果与重排之间增加一个粗排阶段和一个精排阶段。 | |
| 12 | + | |
| 13 | +1. ES召回:600条 | |
| 14 | +2. 粗排阶段:600 -> 240条。 | |
| 15 | + 在配置文件中增加粗排相关配置,包括输入大小(设置为700,即从ES获取的数量应改为粗排的输入大小)。 | |
| 16 | + 然后增加粗排的融合公式配置。可参考现有的重排融合公式,但需要去掉重排模型那一项。 | |
| 17 | +3. 目前重排的SKU选择和标题后缀生成逻辑,应放在粗排之后,因为精排也是一个重排模型(轻量级模型,`bge-reranker`),它也需要这个标题后缀。 | |
| 18 | +4. 精排阶段:240 -> 80条。 | |
| 19 | + 使用`bge-reranker`,但由于目前重排只能选择一个后端,请考虑如何重构。 | |
| 20 | + 现在精排也应该是一个独立的进程和服务,拥有自己的端口。 | |
| 21 | + 但由于其逻辑与重排逻辑一致(即便存在差异,这些差异也应抽离并配置化),代码应该共享。按需启动两个实例即可,避免代码重复。 | |
| 22 | +5. 最终重排:80条。 | |
| 23 | + 仍使用当前重排代码,调用独立的实例(即当前在用的那个)。 | |
| 24 | + 返回后,应用融合公式,再进行分页。 | |
| 25 | + 这里也应参考当前的融合公式,但需增加一项:精排模型得分。 | |
| 26 | + | |
| 27 | +测试时,请使用与我相同的环境,并使用以下命令重启相关服务: | |
| 28 | +`./scripts/service_ctl.sh reranker backend` | |
| 29 | + | |
| 30 | +你已经完成了整体修改,并将三级排序流水线串联起来了。 | |
| 31 | +主要改动在: | |
| 32 | +`search/searcher.py` | |
| 33 | +`search/rerank_client.py` | |
| 34 | +`config/schema.py` | |
| 35 | +`config/loader.py` | |
| 36 | +`config/services_config.py` | |
| 37 | +以及 `config/config.yaml`。 | |
| 38 | + | |
| 39 | +现在的流程是: | |
| 40 | + | |
| 41 | +ES获取 `coarse_rank.input_window` 条 -> | |
| 42 | +粗排通过文本/KNN融合裁剪至 `coarse_rank.output_window` -> | |
| 43 | +然后进行SKU选择和标题后缀处理 -> | |
| 44 | +精排调用轻量重排,裁剪至 `fine_rank.output_window` -> | |
| 45 | +最终重排调用现有重排,最终融合时也加入了 `fine_score`。 | |
| 46 | + | |
| 47 | +同时,重排客户端/提供者已改为通过 `service_profile` 选择不同的 `service_url`,因此精排和最终重排可以共享同一套服务代码,仅以不同实例运行。 | |
| 48 | + | |
| 49 | +你还重构了调试展示。 | |
| 50 | +你修改了结果卡片和全局调试面板,使其按漏斗阶段读取并渲染数值,在 `app.js` 中,你现在分别渲染ES召回、粗排、精排和最终重排。 | |
| 51 | + | |
| 52 | +现在,每个结果的调试信息按阶段展示: | |
| 53 | + | |
| 54 | +* ES召回:`rank`、ES得分、归一化得分、匹配查询 | |
| 55 | +* 粗排:`rank` / `rank_change`、`coarse_score`、文本/KNN输入、`text_source` / `text_translation` / `text_primary` / `text_support`、`text_knn` / `image_knn`、`factor` | |
| 56 | +* 精排:`rank` / `rank_change`、`fine_score`、`fine input` | |
| 57 | +* 最终重排:`rank` / `rank_change`、`rerank_score`、文本/KNN得分、各因子、`fused_score` 以及完整信号 | |
| 58 | + | |
| 59 | +请仔细阅读这些漏斗阶段的代码,特别是涉及打分、重排和调试信息记录的部分。 | |
| 60 | + | |
| 61 | +现在,请注意需要优化的部分: | |
| 62 | + | |
| 63 | +1. 精排阶段似乎没有计算融合公式并据此重排。请修复此问题。 | |
| 64 | +2. 从软件工程的角度审视代码: | |
| 65 | + 既然引入了多级排序漏斗,数据记录、传递和交互接口的设计是否足够合理?存在哪些问题? | |
| 66 | + 请从软件工程角度审视这一逻辑,判断是否有需要重新组织、清理或重写的部分。 | |
| 67 | +3. 优化精排和最终重排阶段的信息记录: | |
| 68 | + 这两个阶段都应体现融合公式的输入、关键因子以及融合公式计算出的得分。 | |
| 69 | + 为避免代码臃肿,精排和最终重排都可以使用一个字符串来记录这些关键信息。该字符串可以包含融合公式中各项的名称和值,以及最终结果。 | |
| 70 | + 你也可以继续使用当前的记录方式;请对比哪种方式代码更少、更清晰简洁。 | |
| 71 | + 同时请仔细思考当前代码:实际的计算过程和记录的信息是否分离?是否存在冗余或分歧? | |
| 72 | + 这是不可取的,因为会引入潜在风险:如果后续修改了生产逻辑但未更新调试信息,就会导致不一致。 | |
| 73 | + | |
| 74 | +请特别注意:现在已经存在相关的信息记录逻辑。不要只是层层打补丁。 | |
| 75 | +你可以适当修改,或者清理重写,而不仅仅是增加代码。 | |
| 76 | +目标是让代码更简单、更干净,同时确保记录的信息始终与实际逻辑保持一致。 | |
| 77 | + | |
| 78 | +涉及代码较多,请耐心阅读。 | |
| 79 | +以上所有任务都需要深入思考。请慢慢来,为全面的审查和重新设计留出足够空间。 | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | +**整体图** | |
| 87 | +这个 pipeline 现在可以理解成一条“先广召回,再逐层收窄、逐层加贵信号”的漏斗: | |
| 88 | + | |
| 89 | +1. Query 解析 | |
| 90 | +2. ES 召回 | |
| 91 | +3. 粗排:只用 ES 内部文本/KNN 信号 | |
| 92 | +4. 款式 SKU 选择 + title suffix | |
| 93 | +5. 精排:轻量 reranker + 文本/KNN 融合 | |
| 94 | +6. 最终 rerank:重 reranker + fine score + 文本/KNN 融合 | |
| 95 | +7. 分页、补全字段、格式化返回 | |
| 96 | + | |
| 97 | +主控代码在 [searcher.py](/data/saas-search/search/searcher.py),打分与 rerank 细节在 [rerank_client.py](/data/saas-search/search/rerank_client.py),配置定义在 [schema.py](/data/saas-search/config/schema.py) 和 [config.yaml](/data/saas-search/config/config.yaml)。 | |
| 98 | + | |
| 99 | +**先看入口怎么决定走哪条路** | |
| 100 | +在 [searcher.py:348](/data/saas-search/search/searcher.py#L348) 开始,`search()` 先读租户语言、开关、窗口大小。 | |
| 101 | +关键判断在 [searcher.py:364](/data/saas-search/search/searcher.py#L364) 到 [searcher.py:372](/data/saas-search/search/searcher.py#L372): | |
| 102 | + | |
| 103 | +- `rerank_window` 现在是 80,见 [config.yaml:256](/data/saas-search/config/config.yaml#L256) | |
| 104 | +- `coarse_rank.input_window` 是 700,`output_window` 是 240,见 [config.yaml:231](/data/saas-search/config/config.yaml#L231) | |
| 105 | +- `fine_rank.input_window` 是 240,`output_window` 是 80,见 [config.yaml:245](/data/saas-search/config/config.yaml#L245) | |
| 106 | + | |
| 107 | +所以如果请求满足 `from_ + size <= rerank_window`,就进入完整漏斗: | |
| 108 | +- ES 实际取前 `700` | |
| 109 | +- 粗排后留 `240` | |
| 110 | +- 精排后留 `80` | |
| 111 | +- 最终 rerank 也只处理这 `80` | |
| 112 | +- 最后再做分页切片 | |
| 113 | + | |
| 114 | +如果请求页超出 80,就不走后面的多阶段漏斗,直接按 ES 原逻辑返回。 | |
| 115 | + | |
| 116 | +这点非常重要,因为它决定了“贵模型只服务头部结果”。 | |
| 117 | + | |
| 118 | +**Step 1:Query 解析阶段** | |
| 119 | +在 [searcher.py:432](/data/saas-search/search/searcher.py#L432) 到 [searcher.py:469](/data/saas-search/search/searcher.py#L469): | |
| 120 | +`query_parser.parse()` 做几件事: | |
| 121 | + | |
| 122 | +- 规范化 query | |
| 123 | +- 检测语言 | |
| 124 | +- 可能做 rewrite | |
| 125 | +- 生成文本向量 | |
| 126 | +- 如果有图搜,还会带图片向量 | |
| 127 | +- 生成翻译结果 | |
| 128 | +- 识别 style intent | |
| 129 | + | |
| 130 | +这一步的结果存在 `parsed_query` 里,后面 ES 查询、style SKU 选择、fine/final rerank 全都依赖它。 | |
| 131 | + | |
| 132 | +**Step 2:ES Query 构建** | |
| 133 | +ES DSL 在 [searcher.py:471](/data/saas-search/search/searcher.py#L471) 开始,通过 [es_query_builder.py:181](/data/saas-search/search/es_query_builder.py#L181) 的 `build_query()` 生成。 | |
| 134 | + | |
| 135 | +这里的核心结构是: | |
| 136 | +- 文本召回 clause | |
| 137 | +- 文本向量 KNN clause | |
| 138 | +- 图片向量 KNN clause | |
| 139 | +- 它们一起放进 `bool.should` | |
| 140 | +- 过滤条件放进 `filter` | |
| 141 | +- facet 的多选条件走 `post_filter` | |
| 142 | + | |
| 143 | +KNN 部分在 [es_query_builder.py:250](/data/saas-search/search/es_query_builder.py#L250) 之后: | |
| 144 | +- 文本向量 clause 名字固定叫 `knn_query` | |
| 145 | +- 图片向量 clause 名字固定叫 `image_knn_query` | |
| 146 | + | |
| 147 | +而文本召回那边,后续 fusion 代码约定会去读: | |
| 148 | +- 原始 query 的 named query:`base_query` | |
| 149 | +- 翻译 query 的 named query:`base_query_trans_*` | |
| 150 | + | |
| 151 | +也就是说,后面的粗排/精排/最终 rerank,并不是重新理解 ES score,而是从 `matched_queries` 里把这些命名子信号拆出来自己重算。 | |
| 152 | + | |
| 153 | +**Step 3:ES 召回** | |
| 154 | +在 [searcher.py:579](/data/saas-search/search/searcher.py#L579) 到 [searcher.py:627](/data/saas-search/search/searcher.py#L627)。 | |
| 155 | + | |
| 156 | +这里有个很关键的工程优化: | |
| 157 | +如果在 rerank window 内,第一次 ES 拉取时会把 `_source` 关掉,只取排序必需信号,见 [searcher.py:517](/data/saas-search/search/searcher.py#L517) 到 [searcher.py:523](/data/saas-search/search/searcher.py#L523)。 | |
| 158 | + | |
| 159 | +原因是: | |
| 160 | +- 粗排先只需要 `_score` 和 `matched_queries` | |
| 161 | +- 不需要一上来把 700 条完整商品详情都拉回来 | |
| 162 | +- 等粗排收窄后,再补 fine/final rerank 需要的字段 | |
| 163 | + | |
| 164 | +这是现在这条 pipeline 很核心的性能设计点。 | |
| 165 | + | |
| 166 | +**Step 4:粗排** | |
| 167 | +粗排入口在 [searcher.py:638](/data/saas-search/search/searcher.py#L638),真正的打分在 [rerank_client.py:348](/data/saas-search/search/rerank_client.py#L348) 的 `coarse_resort_hits()`。 | |
| 168 | + | |
| 169 | +粗排只看两类信号: | |
| 170 | +- `text_score` | |
| 171 | +- `knn_score` | |
| 172 | + | |
| 173 | +它们先都从统一 helper `_build_hit_signal_bundle()` 里拿,见 [rerank_client.py:246](/data/saas-search/search/rerank_client.py#L246)。 | |
| 174 | + | |
| 175 | +文本分怎么来,见 [rerank_client.py:200](/data/saas-search/search/rerank_client.py#L200): | |
| 176 | +- `source_score = matched_queries["base_query"]` | |
| 177 | +- `translation_score = max(base_query_trans_*)` | |
| 178 | +- `weighted_translation = 0.8 * translation_score` | |
| 179 | +- `primary_text = max(source, weighted_translation)` | |
| 180 | +- `support_text = 另一路` | |
| 181 | +- `text_score = primary_text + 0.25 * support_text` | |
| 182 | + | |
| 183 | +这就是一个 text dismax 思路: | |
| 184 | +原 query 是主路,翻译 query 是辅助路,但不是简单相加。 | |
| 185 | + | |
| 186 | +向量分怎么来,见 [rerank_client.py:156](/data/saas-search/search/rerank_client.py#L156): | |
| 187 | +- `text_knn_score` | |
| 188 | +- `image_knn_score` | |
| 189 | +- 分别乘自己的 weight | |
| 190 | +- 取强的一路做主路 | |
| 191 | +- 弱的一路按 `knn_tie_breaker` 做辅助 | |
| 192 | + | |
| 193 | +然后粗排融合公式在 [rerank_client.py:334](/data/saas-search/search/rerank_client.py#L334): | |
| 194 | +- `coarse_score = (text_score + text_bias)^text_exponent * (knn_score + knn_bias)^knn_exponent` | |
| 195 | + | |
| 196 | +配置定义在 [schema.py:124](/data/saas-search/config/schema.py#L124) 和 [config.yaml:231](/data/saas-search/config/config.yaml#L231)。 | |
| 197 | + | |
| 198 | +算完后: | |
| 199 | +- 写入 `hit["_coarse_score"]` | |
| 200 | +- 按 `_coarse_score` 排序 | |
| 201 | +- 留前 240,见 [searcher.py:645](/data/saas-search/search/searcher.py#L645) | |
| 202 | + | |
| 203 | +**Step 5:粗排后补字段 + SKU 选择** | |
| 204 | +粗排完以后,`searcher` 会按 doc template 反推 fine/final rerank 需要哪些 `_source` 字段,然后只补这些字段,见 [searcher.py:669](/data/saas-search/search/searcher.py#L669)。 | |
| 205 | + | |
| 206 | +之后才做 style SKU 选择,见 [searcher.py:696](/data/saas-search/search/searcher.py#L696)。 | |
| 207 | + | |
| 208 | +为什么放这里? | |
| 209 | +因为现在 fine rank 也是 reranker,它也要吃 title suffix。 | |
| 210 | +而 suffix 是 SKU 选择之后写到 hit 上的 `_style_rerank_suffix`。 | |
| 211 | +真正把 suffix 拼进 doc 文本的地方在 [rerank_client.py:65](/data/saas-search/search/rerank_client.py#L65) 到 [rerank_client.py:74](/data/saas-search/search/rerank_client.py#L74)。 | |
| 212 | + | |
| 213 | +所以顺序必须是: | |
| 214 | +- 先粗排 | |
| 215 | +- 再选 SKU | |
| 216 | +- 再用带 suffix 的 title 去跑 fine/final rerank | |
| 217 | + | |
| 218 | +**Step 6:精排** | |
| 219 | +入口在 [searcher.py:711](/data/saas-search/search/searcher.py#L711),实现是 [rerank_client.py:603](/data/saas-search/search/rerank_client.py#L603) 的 `run_lightweight_rerank()`。 | |
| 220 | + | |
| 221 | +它会做三件事: | |
| 222 | + | |
| 223 | +1. 用 `build_docs_from_hits()` 把每条商品变成 reranker 输入文本 | |
| 224 | +2. 用 `service_profile="fine"` 调轻量服务 | |
| 225 | +3. 不再只按 `fine_score` 排,而是按融合后的 `_fine_fused_score` 排 | |
| 226 | + | |
| 227 | +精排融合公式现在是: | |
| 228 | +- `fine_stage_score = fine_factor * text_factor * knn_factor * style_boost` | |
| 229 | + | |
| 230 | +具体公共计算在 [rerank_client.py:286](/data/saas-search/search/rerank_client.py#L286) 的 `_compute_multiplicative_fusion()`: | |
| 231 | +- `fine_factor = (fine_score + fine_bias)^fine_exponent` | |
| 232 | +- `text_factor = (text_score + text_bias)^text_exponent` | |
| 233 | +- `knn_factor = (knn_score + knn_bias)^knn_exponent` | |
| 234 | +- 如果命中了 selected SKU,再乘 style boost | |
| 235 | + | |
| 236 | +写回 hit 的字段见 [rerank_client.py:655](/data/saas-search/search/rerank_client.py#L655): | |
| 237 | +- `_fine_score` | |
| 238 | +- `_fine_fused_score` | |
| 239 | +- `_text_score` | |
| 240 | +- `_knn_score` | |
| 241 | + | |
| 242 | +排序逻辑在 [rerank_client.py:683](/data/saas-search/search/rerank_client.py#L683): | |
| 243 | +按 `_fine_fused_score` 降序排,然后留前 80,见 [searcher.py:727](/data/saas-search/search/searcher.py#L727)。 | |
| 244 | + | |
| 245 | +这就是你这次特别关心的点:现在 fine rank 已经不是“模型裸分排序”,而是“模型分 + ES 文本/KNN 信号融合后排序”。 | |
| 246 | + | |
| 247 | +**Step 7:最终 rerank** | |
| 248 | +入口在 [searcher.py:767](/data/saas-search/search/searcher.py#L767),实现是 [rerank_client.py:538](/data/saas-search/search/rerank_client.py#L538) 的 `run_rerank()`。 | |
| 249 | + | |
| 250 | +它和 fine rank 很像,但多了一个更重的模型分 `rerank_score`。 | |
| 251 | +最终公式是: | |
| 252 | + | |
| 253 | +- `final_score = rerank_factor * fine_factor * text_factor * knn_factor * style_boost` | |
| 254 | + | |
| 255 | +也就是: | |
| 256 | +- fine rank 产生的 `fine_score` 不会丢 | |
| 257 | +- 到最终 rerank 时,它会继续作为一个乘法项参与最终融合 | |
| 258 | + | |
| 259 | +这个逻辑在 [rerank_client.py:468](/data/saas-search/search/rerank_client.py#L468) 到 [rerank_client.py:476](/data/saas-search/search/rerank_client.py#L476)。 | |
| 260 | + | |
| 261 | +算完后写入: | |
| 262 | +- `_rerank_score` | |
| 263 | +- `_fused_score` | |
| 264 | + | |
| 265 | +然后按 `_fused_score` 排序,见 [rerank_client.py:531](/data/saas-search/search/rerank_client.py#L531)。 | |
| 266 | + | |
| 267 | +这里你可以把它理解成: | |
| 268 | +- fine rank 负责“轻量快速筛一遍,把 240 缩成 80” | |
| 269 | +- 最终 rerank 负责“用更贵模型做最终拍板” | |
| 270 | +- 但最终拍板时,不会忽略 fine rank 结果,而是把 fine score 当成一个先验信号保留进去 | |
| 271 | + | |
| 272 | +**Step 8:分页与字段补全** | |
| 273 | +多阶段排序只在头部窗口内完成。 | |
| 274 | +真正返回给用户前,在 [searcher.py:828](/data/saas-search/search/searcher.py#L828) 之后还会做两件事: | |
| 275 | + | |
| 276 | +- 先按 `from_:from_+size` 对最终 80 条切片 | |
| 277 | +- 再按用户原始 `_source` 需求补回页面真正要显示的字段,见 [searcher.py:859](/data/saas-search/search/searcher.py#L859) | |
| 278 | + | |
| 279 | +所以这条链路是“三次不同目的的数据访问”: | |
| 280 | + | |
| 281 | +- 第一次 ES:只要排序信号 | |
| 282 | +- 第二次按 id 回填:只要 fine/final rerank 需要字段 | |
| 283 | +- 第三次按页面 ids 回填:只要最终页面显示字段 | |
| 284 | + | |
| 285 | +这也是为什么它性能上比“一次全量拉 700 条完整文档”更合理。 | |
| 286 | + | |
| 287 | +**Step 9:结果格式化与 debug funnel** | |
| 288 | +最后在 [searcher.py:906](/data/saas-search/search/searcher.py#L906) 进入结果处理。 | |
| 289 | +这里会把每个商品的阶段信息组装成 `ranking_funnel`,见 [searcher.py:1068](/data/saas-search/search/searcher.py#L1068): | |
| 290 | + | |
| 291 | +- `es_recall` | |
| 292 | +- `coarse_rank` | |
| 293 | +- `fine_rank` | |
| 294 | +- `rerank` | |
| 295 | +- `final_page` | |
| 296 | + | |
| 297 | +其中: | |
| 298 | +- coarse stage 主要保留 text/translation/knn 的拆分信号 | |
| 299 | +- fine/rerank stage 现在都保留 `fusion_inputs`、`fusion_factors`、`fusion_summary` | |
| 300 | +- `fusion_summary` 来自真实计算过程本身,见 [rerank_client.py:265](/data/saas-search/search/rerank_client.py#L265) | |
| 301 | + | |
| 302 | +这点很重要,因为现在“实际排序逻辑”和“debug 展示逻辑”是同源的,不是两套各写一份。 | |
| 303 | + | |
| 304 | +**一句话总结这条 pipeline** | |
| 305 | +这条 pipeline 的本质是: | |
| 306 | + | |
| 307 | +- ES 负责便宜的大范围召回 | |
| 308 | +- 粗排负责只靠 ES 内置信号先做一次结构化筛选 | |
| 309 | +- style SKU 选择负责把商品文本改造成更适合 reranker 理解的输入 | |
| 310 | +- fine rank 负责用轻模型把候选进一步压缩 | |
| 311 | +- final rerank 负责用重模型做最终判定 | |
| 312 | +- 每一层都尽量复用前一层信号,而不是推翻重来 | |
| 313 | + | |
| 314 | +如果你愿意,我下一步可以继续按“一个具体 query 的真实流转样例”来讲,比如假设用户搜 `black dress`,我把它从 `parsed_query`、ES named queries、coarse/fine/final 的每个分数怎么出来,完整手推一遍。 | |
| 0 | 315 | \ No newline at end of file | ... | ... |
docs/常用查询 - ES.md
| 1 | 1 | |
| 2 | 2 | |
| 3 | -# 查看所有租户索引 | |
| 4 | - curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/_cat/indices/search_products_tenant_*?v' | |
| 3 | +## Elasticsearch 排查流程 | |
| 5 | 4 | |
| 6 | -# ====================================== | |
| 7 | -# 租户相关 | |
| 8 | -# ====================================== | |
| 9 | -# | |
| 10 | -# 说明:索引已按租户拆分为 search_products_tenant_{tenant_id}, | |
| 11 | -# 一般情况下不需要在查询中再按 tenant_id 过滤(可选保留用于排查)。 | |
| 5 | +### 1. 集群健康状态 | |
| 6 | + | |
| 7 | +```bash | |
| 8 | +# 集群整体健康(green / yellow / red) | |
| 9 | +curl -s -u 'saas:4hOaLaf41y2VuI8y' 'http://127.0.0.1:9200/_cluster/health?pretty' | |
| 10 | +``` | |
| 11 | + | |
| 12 | +### 2. 索引概览 | |
| 13 | + | |
| 14 | +```bash | |
| 15 | +# 查看所有租户索引状态与体积 | |
| 16 | +curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/_cat/indices/search_products_tenant_*?v' | |
| 17 | + | |
| 18 | +# 或查看全部索引 | |
| 19 | +curl -s -u 'saas:4hOaLaf41y2VuI8y' 'http://127.0.0.1:9200/_cat/indices?v' | |
| 20 | +``` | |
| 21 | + | |
| 22 | +### 3. 分片分布 | |
| 23 | + | |
| 24 | +```bash | |
| 25 | +# 查看分片在各节点的分布情况 | |
| 26 | +curl -s -u 'saas:4hOaLaf41y2VuI8y' 'http://127.0.0.1:9200/_cat/shards?v' | |
| 27 | +``` | |
| 28 | + | |
| 29 | +### 4. 分配诊断(如有异常) | |
| 30 | + | |
| 31 | +```bash | |
| 32 | +# 当 health 非 green 或 shards 状态异常时,定位具体原因 | |
| 33 | +curl -s -u 'saas:4hOaLaf41y2VuI8y' -X POST 'http://127.0.0.1:9200/_cluster/allocation/explain?pretty' \ | |
| 34 | + -H 'Content-Type: application/json' \ | |
| 35 | + -d '{"index":"search_products_tenant_163","shard":0,"primary":true}' | |
| 36 | +``` | |
| 37 | + | |
| 38 | +> 典型结论示例:`disk_threshold` — 磁盘超过高水位,新分片禁止分配。 | |
| 39 | + | |
| 40 | +### 5. 系统层检查 | |
| 41 | + | |
| 42 | +```bash | |
| 43 | +# 服务状态 | |
| 44 | +sudo systemctl status elasticsearch | |
| 45 | + | |
| 46 | +# 磁盘空间 | |
| 47 | +df -h | |
| 48 | + | |
| 49 | +# ES 数据目录占用 | |
| 50 | +du -sh /var/lib/elasticsearch/ | |
| 51 | +``` | |
| 52 | + | |
| 53 | +### 6. 配置与日志 | |
| 54 | + | |
| 55 | +```bash | |
| 56 | +# 配置文件 | |
| 57 | +cat /etc/elasticsearch/elasticsearch.yml | |
| 58 | + | |
| 59 | +# 实时日志 | |
| 60 | +journalctl -u elasticsearch -f | |
| 61 | +``` | |
| 62 | + | |
| 63 | +--- | |
| 64 | + | |
| 65 | +### 快速排查路径 | |
| 66 | + | |
| 67 | +``` | |
| 68 | +_cluster/health → 确认集群状态(green/yellow/red) | |
| 69 | + ↓ | |
| 70 | +_cat/indices → 检查索引体积与状态 | |
| 71 | + ↓ | |
| 72 | +_cat/shards → 查看分片分布 | |
| 73 | + ↓ | |
| 74 | +_cluster/allocation/explain → 定位分配问题(如需要) | |
| 75 | + ↓ | |
| 76 | +systemctl / df / 日志 → 系统层验证 | |
| 77 | +``` | |
| 78 | + | |
| 79 | +--- | |
| 80 | +以下是将您提供的 Elasticsearch 查询整理为 Markdown 格式的文档: | |
| 81 | + | |
| 82 | +--- | |
| 83 | + | |
| 84 | +# Elasticsearch 查询集合 | |
| 85 | + | |
| 86 | +## 租户相关 | |
| 87 | + | |
| 88 | +> **说明**:索引已按租户拆分为 `search_products_tenant_{tenant_id}`,一般情况下不需要在查询中再按 `tenant_id` 过滤(可选保留用于排查)。 | |
| 89 | + | |
| 90 | +--- | |
| 12 | 91 | |
| 13 | 92 | ### 1. 根据 tenant_id / spu_id 查询 |
| 14 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 93 | + | |
| 94 | +#### 查询指定 spu_id 的商品(返回 title) | |
| 95 | +```bash | |
| 96 | +curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 15 | 97 | "size": 11, |
| 16 | 98 | "_source": ["title"], |
| 17 | 99 | "query": { |
| 18 | - "bool": { | |
| 19 | - "filter": [ | |
| 20 | - { "term": {"spu_id" : 206150} } | |
| 21 | - ] | |
| 22 | - } | |
| 100 | + "bool": { | |
| 101 | + "filter": [ | |
| 102 | + { "term": {"spu_id" : 206150} } | |
| 103 | + ] | |
| 104 | + } | |
| 23 | 105 | } |
| 24 | - }' | |
| 25 | - | |
| 26 | - | |
| 27 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 28 | - "size": 100, | |
| 29 | - "_source": ["title"], | |
| 30 | - "query": { | |
| 31 | - "match_all": {} | |
| 32 | - } | |
| 33 | 106 | }' |
| 107 | +``` | |
| 34 | 108 | |
| 35 | - | |
| 36 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 37 | - "size": 5, | |
| 38 | - "_source": ["title", "keywords", "tags"], | |
| 39 | - "query": { | |
| 40 | - "bool": { | |
| 41 | - "filter": [ | |
| 42 | - { "term": { "spu_id": "223167" } } | |
| 43 | - ] | |
| 109 | +#### 查询所有商品(返回 title) | |
| 110 | +```bash | |
| 111 | +curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 112 | + "size": 100, | |
| 113 | + "_source": ["title"], | |
| 114 | + "query": { | |
| 115 | + "match_all": {} | |
| 44 | 116 | } |
| 45 | - } | |
| 46 | 117 | }' |
| 118 | +``` | |
| 47 | 119 | |
| 48 | - | |
| 49 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 50 | - "size": 1, | |
| 51 | - "_source": ["title", "keywords", "tags"], | |
| 52 | - "query": { | |
| 53 | - "bool": { | |
| 54 | - "must": [ | |
| 55 | - { | |
| 56 | - "match": { | |
| 57 | - "title.en": { | |
| 58 | - "query": "Floerns Women Gothic Graphic Ribbed Strapless Tube Top Asymmetrical Ruched Bandeau Tops" | |
| 59 | - } | |
| 60 | - } | |
| 120 | +#### 查询指定 spu_id 的商品(返回 title、keywords、tags) | |
| 121 | +```bash | |
| 122 | +curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 123 | + "size": 5, | |
| 124 | + "_source": ["title", "keywords", "tags"], | |
| 125 | + "query": { | |
| 126 | + "bool": { | |
| 127 | + "filter": [ | |
| 128 | + { "term": { "spu_id": "223167" } } | |
| 129 | + ] | |
| 61 | 130 | } |
| 62 | - ], | |
| 63 | - "filter": [ | |
| 64 | - { "terms": { "tags": ["女装", "派对"] } } | |
| 65 | - ] | |
| 66 | 131 | } |
| 67 | - } | |
| 68 | 132 | }' |
| 133 | +``` | |
| 69 | 134 | |
| 135 | +#### 组合查询:匹配标题 + 过滤标签 | |
| 136 | +```bash | |
| 137 | +curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 138 | + "size": 1, | |
| 139 | + "_source": ["title", "keywords", "tags"], | |
| 140 | + "query": { | |
| 141 | + "bool": { | |
| 142 | + "must": [ | |
| 143 | + { | |
| 144 | + "match": { | |
| 145 | + "title.en": { | |
| 146 | + "query": "Floerns Women Gothic Graphic Ribbed Strapless Tube Top Asymmetrical Ruched Bandeau Tops" | |
| 147 | + } | |
| 148 | + } | |
| 149 | + } | |
| 150 | + ], | |
| 151 | + "filter": [ | |
| 152 | + { "terms": { "tags": ["女装", "派对"] } } | |
| 153 | + ] | |
| 154 | + } | |
| 155 | + } | |
| 156 | +}' | |
| 157 | +``` | |
| 70 | 158 | |
| 71 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 159 | +#### 组合查询:匹配标题 + 过滤租户(冗余示例) | |
| 160 | +```bash | |
| 161 | +curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 72 | 162 | "size": 1, |
| 73 | 163 | "_source": ["title"], |
| 74 | 164 | "query": { |
| 75 | - "bool": { | |
| 76 | - "must": [ | |
| 77 | - { | |
| 78 | - "match": { | |
| 79 | - "title.en": { | |
| 80 | - "query": "Floerns Women Gothic Graphic Ribbed Strapless Tube Top Asymmetrical Ruched Bandeau Tops" | |
| 81 | - } | |
| 82 | - } | |
| 83 | - } | |
| 84 | - ], | |
| 85 | - "filter": [ | |
| 86 | - { "term": { "tenant_id": "170" } } | |
| 87 | - ] | |
| 88 | - } | |
| 165 | + "bool": { | |
| 166 | + "must": [ | |
| 167 | + { | |
| 168 | + "match": { | |
| 169 | + "title.en": { | |
| 170 | + "query": "Floerns Women Gothic Graphic Ribbed Strapless Tube Top Asymmetrical Ruched Bandeau Tops" | |
| 171 | + } | |
| 172 | + } | |
| 173 | + } | |
| 174 | + ], | |
| 175 | + "filter": [ | |
| 176 | + { "term": { "tenant_id": "170" } } | |
| 177 | + ] | |
| 178 | + } | |
| 89 | 179 | } |
| 90 | 180 | }' |
| 181 | +``` | |
| 91 | 182 | |
| 92 | -Curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_analyze' -H 'Content-Type: application/json' -d '{ | |
| 93 | - "analyzer": "index_ik", | |
| 94 | - "text": "14寸第4代-眼珠实身冰雪公仔带手动大推车,搪胶雪宝宝" | |
| 95 | -}' | |
| 183 | +--- | |
| 184 | + | |
| 185 | +### 2. 分析器测试 | |
| 96 | 186 | |
| 97 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_analyze' -H 'Content-Type: application/json' -d '{ | |
| 98 | - "analyzer": "query_ik", | |
| 99 | - "text": "14寸第4代-眼珠实身冰雪公仔带手动大推车,搪胶雪宝宝" | |
| 187 | +#### 测试 index_ik 分析器 | |
| 188 | +```bash | |
| 189 | +curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_analyze' -H 'Content-Type: application/json' -d '{ | |
| 190 | + "analyzer": "index_ik", | |
| 191 | + "text": "14寸第4代-眼珠实身冰雪公仔带手动大推车,搪胶雪宝宝" | |
| 100 | 192 | }' |
| 193 | +``` | |
| 101 | 194 | |
| 102 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 103 | - "size": 100, | |
| 104 | - "from": 0, | |
| 105 | - "query": { | |
| 106 | - "bool": { | |
| 107 | - "must": [ | |
| 108 | - { | |
| 109 | - "multi_match": { | |
| 110 | - "_name": "base_query", | |
| 111 | - "fields": [ | |
| 112 | - "title.zh^3.0", | |
| 113 | - "brief.zh^1.5", | |
| 114 | - "description.zh", | |
| 115 | - "vendor.zh^1.5", | |
| 116 | - "tags", | |
| 117 | - "category_path.zh^1.5", | |
| 118 | - "category_name_text.zh^1.5", | |
| 119 | - "option1_values^0.5" | |
| 120 | - ], | |
| 121 | - "minimum_should_match": "75%", | |
| 122 | - "operator": "AND", | |
| 123 | - "query": "裙", | |
| 124 | - "tie_breaker": 0.9 | |
| 125 | - } | |
| 126 | - } | |
| 127 | - ], | |
| 128 | - "filter": [ | |
| 129 | - { | |
| 130 | - "match_all": {} | |
| 131 | - } | |
| 132 | - ] | |
| 133 | - } | |
| 134 | - } | |
| 195 | +#### 测试 query_ik 分析器 | |
| 196 | +```bash | |
| 197 | +curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_analyze' -H 'Content-Type: application/json' -d '{ | |
| 198 | + "analyzer": "query_ik", | |
| 199 | + "text": "14寸第4代-眼珠实身冰雪公仔带手动大推车,搪胶雪宝宝" | |
| 135 | 200 | }' |
| 201 | +``` | |
| 136 | 202 | |
| 137 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 138 | - "size": 1, | |
| 139 | - "from": 0, | |
| 140 | - "query": { | |
| 141 | - "bool": { | |
| 142 | - "must": [ | |
| 143 | - { | |
| 144 | - "multi_match": { | |
| 145 | - "_name": "base_query", | |
| 146 | - "fields": [ | |
| 147 | - "title.zh^3.0", | |
| 148 | - "brief.zh^1.5", | |
| 149 | - "description.zh", | |
| 150 | - "vendor.zh^1.5", | |
| 151 | - "tags", | |
| 152 | - "category_path.zh^1.5", | |
| 153 | - "category_name_text.zh^1.5", | |
| 154 | - "option1_values^0.5" | |
| 203 | +--- | |
| 204 | + | |
| 205 | +### 3. 多字段搜索 + 聚合(综合分面示例) | |
| 206 | + | |
| 207 | +#### 多字段匹配 + 聚合(category1、color、size、material) | |
| 208 | +```bash | |
| 209 | +curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 210 | + "size": 1, | |
| 211 | + "from": 0, | |
| 212 | + "query": { | |
| 213 | + "bool": { | |
| 214 | + "must": [ | |
| 215 | + { | |
| 216 | + "multi_match": { | |
| 217 | + "_name": "base_query", | |
| 218 | + "fields": [ | |
| 219 | + "title.zh^3.0", | |
| 220 | + "brief.zh^1.5", | |
| 221 | + "description.zh", | |
| 222 | + "vendor.zh^1.5", | |
| 223 | + "tags", | |
| 224 | + "category_path.zh^1.5", | |
| 225 | + "category_name_text.zh^1.5", | |
| 226 | + "option1_values^0.5" | |
| 227 | + ], | |
| 228 | + "minimum_should_match": "75%", | |
| 229 | + "operator": "AND", | |
| 230 | + "query": "裙", | |
| 231 | + "tie_breaker": 0.9 | |
| 232 | + } | |
| 233 | + } | |
| 155 | 234 | ], |
| 156 | - "minimum_should_match": "75%", | |
| 157 | - "operator": "AND", | |
| 158 | - "query": "裙", | |
| 159 | - "tie_breaker": 0.9 | |
| 160 | - } | |
| 161 | - } | |
| 162 | - ], | |
| 163 | - "filter": [ | |
| 164 | - { "match_all": {} } | |
| 165 | - ] | |
| 166 | - } | |
| 167 | - }, | |
| 168 | - "aggs": { | |
| 169 | - "category1_name_facet": { | |
| 170 | - "terms": { | |
| 171 | - "field": "category1_name", | |
| 172 | - "size": 15, | |
| 173 | - "order": { | |
| 174 | - "_count": "desc" | |
| 235 | + "filter": [ | |
| 236 | + { "match_all": {} } | |
| 237 | + ] | |
| 175 | 238 | } |
| 176 | - } | |
| 177 | 239 | }, |
| 178 | - "specifications_color_facet": { | |
| 179 | - "nested": { | |
| 180 | - "path": "specifications" | |
| 181 | - }, | |
| 182 | - "aggs": { | |
| 183 | - "filter_by_name": { | |
| 184 | - "filter": { | |
| 185 | - "term": { | |
| 186 | - "specifications.name": "color" | |
| 240 | + "aggs": { | |
| 241 | + "category1_name_facet": { | |
| 242 | + "terms": { | |
| 243 | + "field": "category1_name", | |
| 244 | + "size": 15, | |
| 245 | + "order": { "_count": "desc" } | |
| 187 | 246 | } |
| 188 | - }, | |
| 189 | - "aggs": { | |
| 190 | - "value_counts": { | |
| 191 | - "terms": { | |
| 192 | - "field": "specifications.value", | |
| 193 | - "size": 20, | |
| 194 | - "order": { | |
| 195 | - "_count": "desc" | |
| 247 | + }, | |
| 248 | + "specifications_color_facet": { | |
| 249 | + "nested": { "path": "specifications" }, | |
| 250 | + "aggs": { | |
| 251 | + "filter_by_name": { | |
| 252 | + "filter": { "term": { "specifications.name": "color" } }, | |
| 253 | + "aggs": { | |
| 254 | + "value_counts": { | |
| 255 | + "terms": { | |
| 256 | + "field": "specifications.value", | |
| 257 | + "size": 20, | |
| 258 | + "order": { "_count": "desc" } | |
| 259 | + } | |
| 260 | + } | |
| 261 | + } | |
| 196 | 262 | } |
| 197 | - } | |
| 198 | 263 | } |
| 199 | - } | |
| 200 | - } | |
| 201 | - } | |
| 202 | - }, | |
| 203 | - "specifications_size_facet": { | |
| 204 | - "nested": { | |
| 205 | - "path": "specifications" | |
| 206 | - }, | |
| 207 | - "aggs": { | |
| 208 | - "filter_by_name": { | |
| 209 | - "filter": { | |
| 210 | - "term": { | |
| 211 | - "specifications.name": "size" | |
| 212 | - } | |
| 213 | - }, | |
| 214 | - "aggs": { | |
| 215 | - "value_counts": { | |
| 216 | - "terms": { | |
| 217 | - "field": "specifications.value", | |
| 218 | - "size": 15, | |
| 219 | - "order": { | |
| 220 | - "_count": "desc" | |
| 264 | + }, | |
| 265 | + "specifications_size_facet": { | |
| 266 | + "nested": { "path": "specifications" }, | |
| 267 | + "aggs": { | |
| 268 | + "filter_by_name": { | |
| 269 | + "filter": { "term": { "specifications.name": "size" } }, | |
| 270 | + "aggs": { | |
| 271 | + "value_counts": { | |
| 272 | + "terms": { | |
| 273 | + "field": "specifications.value", | |
| 274 | + "size": 15, | |
| 275 | + "order": { "_count": "desc" } | |
| 276 | + } | |
| 277 | + } | |
| 278 | + } | |
| 221 | 279 | } |
| 222 | - } | |
| 223 | - } | |
| 224 | - } | |
| 225 | - } | |
| 226 | - } | |
| 227 | - }, | |
| 228 | - "specifications_material_facet": { | |
| 229 | - "nested": { | |
| 230 | - "path": "specifications" | |
| 231 | - }, | |
| 232 | - "aggs": { | |
| 233 | - "filter_by_name": { | |
| 234 | - "filter": { | |
| 235 | - "term": { | |
| 236 | - "specifications.name": "material" | |
| 237 | 280 | } |
| 238 | - }, | |
| 239 | - "aggs": { | |
| 240 | - "value_counts": { | |
| 241 | - "terms": { | |
| 242 | - "field": "specifications.value", | |
| 243 | - "size": 10, | |
| 244 | - "order": { | |
| 245 | - "_count": "desc" | |
| 281 | + }, | |
| 282 | + "specifications_material_facet": { | |
| 283 | + "nested": { "path": "specifications" }, | |
| 284 | + "aggs": { | |
| 285 | + "filter_by_name": { | |
| 286 | + "filter": { "term": { "specifications.name": "material" } }, | |
| 287 | + "aggs": { | |
| 288 | + "value_counts": { | |
| 289 | + "terms": { | |
| 290 | + "field": "specifications.value", | |
| 291 | + "size": 10, | |
| 292 | + "order": { "_count": "desc" } | |
| 293 | + } | |
| 294 | + } | |
| 295 | + } | |
| 246 | 296 | } |
| 247 | - } | |
| 248 | 297 | } |
| 249 | - } | |
| 250 | 298 | } |
| 251 | - } | |
| 252 | 299 | } |
| 253 | - } | |
| 254 | 300 | }' |
| 301 | +``` | |
| 302 | + | |
| 303 | +--- | |
| 304 | + | |
| 305 | +### 4. 通用查询(通用索引示例) | |
| 255 | 306 | |
| 307 | +#### 查询所有 | |
| 308 | +```bash | |
| 256 | 309 | GET /search_products_tenant_2/_search |
| 257 | 310 | { |
| 258 | - "query": { | |
| 259 | - "match_all": {} | |
| 260 | - } | |
| 311 | + "query": { | |
| 312 | + "match_all": {} | |
| 313 | + } | |
| 261 | 314 | } |
| 315 | +``` | |
| 262 | 316 | |
| 263 | - | |
| 264 | -curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 317 | +#### 按 spu_id 查询(通用索引) | |
| 318 | +```bash | |
| 319 | +curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products/_search?pretty' -H 'Content-Type: application/json' -d '{ | |
| 265 | 320 | "size": 5, |
| 266 | 321 | "query": { |
| 267 | - "bool": { | |
| 268 | - "filter": [ | |
| 269 | - { "term": { "spu_id": "74123" } } | |
| 270 | - ] | |
| 271 | - } | |
| 322 | + "bool": { | |
| 323 | + "filter": [ | |
| 324 | + { "term": { "spu_id": "74123" } } | |
| 325 | + ] | |
| 326 | + } | |
| 272 | 327 | } |
| 273 | - }' | |
| 328 | +}' | |
| 329 | +``` | |
| 330 | + | |
| 331 | +--- | |
| 274 | 332 | |
| 333 | +### 5. 统计租户总文档数 | |
| 275 | 334 | |
| 276 | -### 2. 统计租户的总文档数 | |
| 335 | +```bash | |
| 277 | 336 | curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_170/_count?pretty' -H 'Content-Type: application/json' -d '{ |
| 278 | - "query": { | |
| 279 | - "match_all": {} | |
| 280 | - } | |
| 337 | + "query": { | |
| 338 | + "match_all": {} | |
| 339 | + } | |
| 281 | 340 | }' |
| 341 | +``` | |
| 282 | 342 | |
| 343 | +--- | |
| 283 | 344 | |
| 284 | -# ====================================== | |
| 285 | -# 分面数据诊断相关查询 | |
| 286 | -# ====================================== | |
| 345 | +## 分面数据诊断相关查询 | |
| 287 | 346 | |
| 288 | -## 1. 检查ES文档的分面字段数据 | |
| 347 | +### 1. 检查 ES 文档的分面字段数据 | |
| 289 | 348 | |
| 290 | -### 1.1 查询特定租户的商品,显示分面相关字段 | |
| 349 | +#### 1.1 查询特定租户的商品,显示分面相关字段 | |
| 350 | +```bash | |
| 291 | 351 | curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_162/_search?pretty' -H 'Content-Type: application/json' -d '{ |
| 292 | - "query": { | |
| 293 | - "term": { | |
| 294 | - "tenant_id": "162" | |
| 295 | - } | |
| 296 | - }, | |
| 297 | - "size": 1, | |
| 298 | - "_source": [ | |
| 299 | - "spu_id", | |
| 300 | - "title", | |
| 301 | - "category1_name", | |
| 302 | - "category2_name", | |
| 303 | - "category3_name", | |
| 304 | - "specifications", | |
| 305 | - "option1_name", | |
| 306 | - "option2_name", | |
| 307 | - "option3_name" | |
| 308 | - ] | |
| 352 | + "query": { | |
| 353 | + "term": { "tenant_id": "162" } | |
| 354 | + }, | |
| 355 | + "size": 1, | |
| 356 | + "_source": [ | |
| 357 | + "spu_id", "title", "category1_name", "category2_name", | |
| 358 | + "category3_name", "specifications", "option1_name", | |
| 359 | + "option2_name", "option3_name" | |
| 360 | + ] | |
| 309 | 361 | }' |
| 362 | +``` | |
| 310 | 363 | |
| 311 | -### 1.2 验证category1_name字段是否有数据 | |
| 364 | +#### 1.2 验证 category1_name 字段是否有数据 | |
| 365 | +```bash | |
| 312 | 366 | curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_162/_search?pretty' -H 'Content-Type: application/json' -d '{ |
| 313 | - "query": { | |
| 314 | - "bool": { | |
| 315 | - "filter": [ | |
| 316 | - { "term": { "tenant_id": "162" } }, | |
| 317 | - { "exists": { "field": "category1_name" } } | |
| 318 | - ] | |
| 319 | - } | |
| 320 | - }, | |
| 321 | - "size": 0 | |
| 367 | + "query": { | |
| 368 | + "bool": { | |
| 369 | + "filter": [ | |
| 370 | + { "term": { "tenant_id": "162" } }, | |
| 371 | + { "exists": { "field": "category1_name" } } | |
| 372 | + ] | |
| 373 | + } | |
| 374 | + }, | |
| 375 | + "size": 0 | |
| 322 | 376 | }' |
| 377 | +``` | |
| 323 | 378 | |
| 324 | -### 1.3 验证specifications字段是否有数据 | |
| 379 | +#### 1.3 验证 specifications 字段是否有数据 | |
| 380 | +```bash | |
| 325 | 381 | curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_162/_search?pretty' -H 'Content-Type: application/json' -d '{ |
| 326 | - "query": { | |
| 327 | - "bool": { | |
| 328 | - "filter": [ | |
| 329 | - { "term": { "tenant_id": "162" } }, | |
| 330 | - { "exists": { "field": "specifications" } } | |
| 331 | - ] | |
| 332 | - } | |
| 333 | - }, | |
| 334 | - "size": 0 | |
| 382 | + "query": { | |
| 383 | + "bool": { | |
| 384 | + "filter": [ | |
| 385 | + { "term": { "tenant_id": "162" } }, | |
| 386 | + { "exists": { "field": "specifications" } } | |
| 387 | + ] | |
| 388 | + } | |
| 389 | + }, | |
| 390 | + "size": 0 | |
| 335 | 391 | }' |
| 392 | +``` | |
| 336 | 393 | |
| 337 | -## 2. 分面聚合查询(Facet Aggregations) | |
| 394 | +--- | |
| 338 | 395 | |
| 339 | -### 2.1 category1_name 分面聚合 | |
| 396 | +### 2. 分面聚合查询(Facet Aggregations) | |
| 397 | + | |
| 398 | +#### 2.1 category1_name 分面聚合 | |
| 399 | +```bash | |
| 340 | 400 | curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_162/_search?pretty' -H 'Content-Type: application/json' -d '{ |
| 341 | - "query": { | |
| 342 | - "match_all": {} | |
| 343 | - }, | |
| 344 | - "size": 0, | |
| 345 | - "aggs": { | |
| 346 | - "category1_name_facet": { | |
| 347 | - "terms": { | |
| 348 | - "field": "category1_name", | |
| 349 | - "size": 50 | |
| 350 | - } | |
| 401 | + "query": { "match_all": {} }, | |
| 402 | + "size": 0, | |
| 403 | + "aggs": { | |
| 404 | + "category1_name_facet": { | |
| 405 | + "terms": { "field": "category1_name", "size": 50 } | |
| 406 | + } | |
| 351 | 407 | } |
| 352 | - } | |
| 353 | 408 | }' |
| 409 | +``` | |
| 354 | 410 | |
| 355 | -### 2.2 specifications.color 分面聚合 | |
| 411 | +#### 2.2 specifications.color 分面聚合 | |
| 412 | +```bash | |
| 356 | 413 | curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_162/_search?pretty' -H 'Content-Type: application/json' -d '{ |
| 357 | - "query": { | |
| 358 | - "match_all": {} | |
| 359 | - }, | |
| 360 | - "size": 0, | |
| 361 | - "aggs": { | |
| 362 | - "specifications_color_facet": { | |
| 363 | - "nested": { | |
| 364 | - "path": "specifications" | |
| 365 | - }, | |
| 366 | - "aggs": { | |
| 367 | - "filtered": { | |
| 368 | - "filter": { | |
| 369 | - "term": { | |
| 370 | - "specifications.name": "color" | |
| 371 | - } | |
| 372 | - }, | |
| 373 | - "aggs": { | |
| 374 | - "values": { | |
| 375 | - "terms": { | |
| 376 | - "field": "specifications.value", | |
| 377 | - "size": 50 | |
| 378 | - } | |
| 414 | + "query": { "match_all": {} }, | |
| 415 | + "size": 0, | |
| 416 | + "aggs": { | |
| 417 | + "specifications_color_facet": { | |
| 418 | + "nested": { "path": "specifications" }, | |
| 419 | + "aggs": { | |
| 420 | + "filtered": { | |
| 421 | + "filter": { "term": { "specifications.name": "color" } }, | |
| 422 | + "aggs": { | |
| 423 | + "values": { "terms": { "field": "specifications.value", "size": 50 } } | |
| 424 | + } | |
| 425 | + } | |
| 379 | 426 | } |
| 380 | - } | |
| 381 | 427 | } |
| 382 | - } | |
| 383 | 428 | } |
| 384 | - } | |
| 385 | 429 | }' |
| 430 | +``` | |
| 386 | 431 | |
| 387 | -### 2.3 specifications.size 分面聚合 | |
| 432 | +#### 2.3 specifications.size 分面聚合 | |
| 433 | +```bash | |
| 388 | 434 | curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_162/_search?pretty' -H 'Content-Type: application/json' -d '{ |
| 389 | - "query": { | |
| 390 | - "match_all": {} | |
| 391 | - }, | |
| 392 | - "size": 0, | |
| 393 | - "aggs": { | |
| 394 | - "specifications_size_facet": { | |
| 395 | - "nested": { | |
| 396 | - "path": "specifications" | |
| 397 | - }, | |
| 398 | - "aggs": { | |
| 399 | - "filtered": { | |
| 400 | - "filter": { | |
| 401 | - "term": { | |
| 402 | - "specifications.name": "size" | |
| 403 | - } | |
| 404 | - }, | |
| 405 | - "aggs": { | |
| 406 | - "values": { | |
| 407 | - "terms": { | |
| 408 | - "field": "specifications.value", | |
| 409 | - "size": 50 | |
| 410 | - } | |
| 435 | + "query": { "match_all": {} }, | |
| 436 | + "size": 0, | |
| 437 | + "aggs": { | |
| 438 | + "specifications_size_facet": { | |
| 439 | + "nested": { "path": "specifications" }, | |
| 440 | + "aggs": { | |
| 441 | + "filtered": { | |
| 442 | + "filter": { "term": { "specifications.name": "size" } }, | |
| 443 | + "aggs": { | |
| 444 | + "values": { "terms": { "field": "specifications.value", "size": 50 } } | |
| 445 | + } | |
| 446 | + } | |
| 411 | 447 | } |
| 412 | - } | |
| 413 | 448 | } |
| 414 | - } | |
| 415 | 449 | } |
| 416 | - } | |
| 417 | 450 | }' |
| 451 | +``` | |
| 418 | 452 | |
| 419 | -### 2.4 specifications.material 分面聚合 | |
| 453 | +#### 2.4 specifications.material 分面聚合 | |
| 454 | +```bash | |
| 420 | 455 | curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_162/_search?pretty' -H 'Content-Type: application/json' -d '{ |
| 421 | - "query": { | |
| 422 | - "match_all": {} | |
| 423 | - }, | |
| 424 | - "size": 0, | |
| 425 | - "aggs": { | |
| 426 | - "specifications_material_facet": { | |
| 427 | - "nested": { | |
| 428 | - "path": "specifications" | |
| 429 | - }, | |
| 430 | - "aggs": { | |
| 431 | - "filtered": { | |
| 432 | - "filter": { | |
| 433 | - "term": { | |
| 434 | - "specifications.name": "material" | |
| 435 | - } | |
| 436 | - }, | |
| 437 | - "aggs": { | |
| 438 | - "values": { | |
| 439 | - "terms": { | |
| 440 | - "field": "specifications.value", | |
| 441 | - "size": 50 | |
| 442 | - } | |
| 456 | + "query": { "match_all": {} }, | |
| 457 | + "size": 0, | |
| 458 | + "aggs": { | |
| 459 | + "specifications_material_facet": { | |
| 460 | + "nested": { "path": "specifications" }, | |
| 461 | + "aggs": { | |
| 462 | + "filtered": { | |
| 463 | + "filter": { "term": { "specifications.name": "material" } }, | |
| 464 | + "aggs": { | |
| 465 | + "values": { "terms": { "field": "specifications.value", "size": 50 } } | |
| 466 | + } | |
| 467 | + } | |
| 443 | 468 | } |
| 444 | - } | |
| 445 | 469 | } |
| 446 | - } | |
| 447 | 470 | } |
| 448 | - } | |
| 449 | 471 | }' |
| 472 | +``` | |
| 450 | 473 | |
| 451 | -### 2.5 综合分面聚合(category + color + size + material) | |
| 474 | +#### 2.5 综合分面聚合(category + color + size + material) | |
| 475 | +```bash | |
| 452 | 476 | curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_162/_search?pretty' -H 'Content-Type: application/json' -d '{ |
| 453 | - "query": { | |
| 454 | - "match_all": {} | |
| 455 | - }, | |
| 456 | - "size": 0, | |
| 457 | - "aggs": { | |
| 458 | - "category1_name_facet": { | |
| 459 | - "terms": { | |
| 460 | - "field": "category1_name", | |
| 461 | - "size": 50 | |
| 462 | - } | |
| 463 | - }, | |
| 464 | - "specifications_color_facet": { | |
| 465 | - "nested": { | |
| 466 | - "path": "specifications" | |
| 467 | - }, | |
| 468 | - "aggs": { | |
| 469 | - "filtered": { | |
| 470 | - "filter": { | |
| 471 | - "term": { | |
| 472 | - "specifications.name": "color" | |
| 473 | - } | |
| 474 | - }, | |
| 475 | - "aggs": { | |
| 476 | - "values": { | |
| 477 | - "terms": { | |
| 478 | - "field": "specifications.value", | |
| 479 | - "size": 50 | |
| 480 | - } | |
| 481 | - } | |
| 482 | - } | |
| 483 | - } | |
| 484 | - } | |
| 485 | - }, | |
| 486 | - "specifications_size_facet": { | |
| 487 | - "nested": { | |
| 488 | - "path": "specifications" | |
| 489 | - }, | |
| 490 | - "aggs": { | |
| 491 | - "filtered": { | |
| 492 | - "filter": { | |
| 493 | - "term": { | |
| 494 | - "specifications.name": "size" | |
| 495 | - } | |
| 496 | - }, | |
| 497 | - "aggs": { | |
| 498 | - "values": { | |
| 499 | - "terms": { | |
| 500 | - "field": "specifications.value", | |
| 501 | - "size": 50 | |
| 502 | - } | |
| 477 | + "query": { "match_all": {} }, | |
| 478 | + "size": 0, | |
| 479 | + "aggs": { | |
| 480 | + "category1_name_facet": { "terms": { "field": "category1_name", "size": 50 } }, | |
| 481 | + "specifications_color_facet": { | |
| 482 | + "nested": { "path": "specifications" }, | |
| 483 | + "aggs": { | |
| 484 | + "filtered": { | |
| 485 | + "filter": { "term": { "specifications.name": "color" } }, | |
| 486 | + "aggs": { "values": { "terms": { "field": "specifications.value", "size": 50 } } } | |
| 487 | + } | |
| 503 | 488 | } |
| 504 | - } | |
| 505 | - } | |
| 506 | - } | |
| 507 | - }, | |
| 508 | - "specifications_material_facet": { | |
| 509 | - "nested": { | |
| 510 | - "path": "specifications" | |
| 511 | - }, | |
| 512 | - "aggs": { | |
| 513 | - "filtered": { | |
| 514 | - "filter": { | |
| 515 | - "term": { | |
| 516 | - "specifications.name": "material" | |
| 489 | + }, | |
| 490 | + "specifications_size_facet": { | |
| 491 | + "nested": { "path": "specifications" }, | |
| 492 | + "aggs": { | |
| 493 | + "filtered": { | |
| 494 | + "filter": { "term": { "specifications.name": "size" } }, | |
| 495 | + "aggs": { "values": { "terms": { "field": "specifications.value", "size": 50 } } } | |
| 496 | + } | |
| 517 | 497 | } |
| 518 | - }, | |
| 519 | - "aggs": { | |
| 520 | - "values": { | |
| 521 | - "terms": { | |
| 522 | - "field": "specifications.value", | |
| 523 | - "size": 50 | |
| 524 | - } | |
| 498 | + }, | |
| 499 | + "specifications_material_facet": { | |
| 500 | + "nested": { "path": "specifications" }, | |
| 501 | + "aggs": { | |
| 502 | + "filtered": { | |
| 503 | + "filter": { "term": { "specifications.name": "material" } }, | |
| 504 | + "aggs": { "values": { "terms": { "field": "specifications.value", "size": 50 } } } | |
| 505 | + } | |
| 525 | 506 | } |
| 526 | - } | |
| 527 | 507 | } |
| 528 | - } | |
| 529 | 508 | } |
| 530 | - } | |
| 531 | 509 | }' |
| 510 | +``` | |
| 511 | + | |
| 512 | +--- | |
| 532 | 513 | |
| 533 | -## 3. 检查specifications嵌套字段的详细结构 | |
| 514 | +### 3. 检查 specifications 嵌套字段的详细结构 | |
| 534 | 515 | |
| 535 | -### 3.1 查看specifications的name字段有哪些值 | |
| 516 | +#### 3.1 查看 specifications 的 name 字段有哪些值 | |
| 517 | +```bash | |
| 536 | 518 | curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products/_search?pretty' -H 'Content-Type: application/json' -d '{ |
| 537 | - "query": { | |
| 538 | - "term": { | |
| 539 | - "tenant_id": "162" | |
| 540 | - } | |
| 541 | - }, | |
| 542 | - "size": 0, | |
| 543 | - "aggs": { | |
| 544 | - "specifications_names": { | |
| 545 | - "nested": { | |
| 546 | - "path": "specifications" | |
| 547 | - }, | |
| 548 | - "aggs": { | |
| 549 | - "name_values": { | |
| 550 | - "terms": { | |
| 551 | - "field": "specifications.name", | |
| 552 | - "size": 20 | |
| 553 | - } | |
| 519 | + "query": { "term": { "tenant_id": "162" } }, | |
| 520 | + "size": 0, | |
| 521 | + "aggs": { | |
| 522 | + "specifications_names": { | |
| 523 | + "nested": { "path": "specifications" }, | |
| 524 | + "aggs": { | |
| 525 | + "name_values": { "terms": { "field": "specifications.name", "size": 20 } } | |
| 526 | + } | |
| 554 | 527 | } |
| 555 | - } | |
| 556 | 528 | } |
| 557 | - } | |
| 558 | 529 | }' |
| 530 | +``` | |
| 559 | 531 | |
| 560 | -### 3.2 查看某个商品的完整specifications数据 | |
| 532 | +#### 3.2 查看某个商品的完整 specifications 数据 | |
| 533 | +```bash | |
| 561 | 534 | curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products/_search?pretty' -H 'Content-Type: application/json' -d '{ |
| 562 | - "query": { | |
| 563 | - "bool": { | |
| 564 | - "filter": [ | |
| 565 | - { "term": { "tenant_id": "162" } }, | |
| 566 | - { "exists": { "field": "specifications" } } | |
| 567 | - ] | |
| 568 | - } | |
| 569 | - }, | |
| 570 | - "size": 1, | |
| 571 | - "_source": ["spu_id", "title", "specifications"] | |
| 535 | + "query": { | |
| 536 | + "bool": { | |
| 537 | + "filter": [ | |
| 538 | + { "term": { "tenant_id": "162" } }, | |
| 539 | + { "exists": { "field": "specifications" } } | |
| 540 | + ] | |
| 541 | + } | |
| 542 | + }, | |
| 543 | + "size": 1, | |
| 544 | + "_source": ["spu_id", "title", "specifications"] | |
| 572 | 545 | }' |
| 546 | +``` | |
| 573 | 547 | |
| 574 | -## 4. 统计查询 | |
| 548 | +--- | |
| 575 | 549 | |
| 576 | -### 4.1 统计有category1_name的文档数量 | |
| 550 | +### 4. 统计查询 | |
| 551 | + | |
| 552 | +#### 4.1 统计有 category1_name 的文档数量 | |
| 553 | +```bash | |
| 577 | 554 | curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_162/_count?pretty' -H 'Content-Type: application/json' -d '{ |
| 578 | - "query": { | |
| 579 | - "bool": { | |
| 580 | - "filter": [ | |
| 581 | - { "exists": { "field": "category1_name" } } | |
| 582 | - ] | |
| 555 | + "query": { | |
| 556 | + "bool": { | |
| 557 | + "filter": [ | |
| 558 | + { "exists": { "field": "category1_name" } } | |
| 559 | + ] | |
| 560 | + } | |
| 583 | 561 | } |
| 584 | - } | |
| 585 | 562 | }' |
| 563 | +``` | |
| 586 | 564 | |
| 587 | -### 4.2 统计有specifications的文档数量 | |
| 565 | +#### 4.2 统计有 specifications 的文档数量 | |
| 566 | +```bash | |
| 588 | 567 | curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_162/_count?pretty' -H 'Content-Type: application/json' -d '{ |
| 589 | - "query": { | |
| 590 | - "bool": { | |
| 591 | - "filter": [ | |
| 592 | - { "exists": { "field": "specifications" } } | |
| 593 | - ] | |
| 568 | + "query": { | |
| 569 | + "bool": { | |
| 570 | + "filter": [ | |
| 571 | + { "exists": { "field": "specifications" } } | |
| 572 | + ] | |
| 573 | + } | |
| 594 | 574 | } |
| 595 | - } | |
| 596 | 575 | }' |
| 576 | +``` | |
| 597 | 577 | |
| 578 | +--- | |
| 598 | 579 | |
| 599 | -## 5. 诊断问题场景 | |
| 580 | +### 5. 诊断问题场景 | |
| 600 | 581 | |
| 601 | -### 5.1 查找没有category1_name但有category的文档(MySQL有数据但ES没有) | |
| 582 | +#### 5.1 查找没有 category1_name 但有 category 的文档(MySQL 有数据但 ES 没有) | |
| 583 | +```bash | |
| 602 | 584 | curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_162/_search?pretty' -H 'Content-Type: application/json' -d '{ |
| 603 | - "query": { | |
| 604 | - "bool": { | |
| 605 | - "filter": [ | |
| 606 | - { "term": { "tenant_id": "162" } } | |
| 607 | - ], | |
| 608 | - "must_not": [ | |
| 609 | - { "exists": { "field": "category1_name" } } | |
| 610 | - ] | |
| 611 | - } | |
| 612 | - }, | |
| 613 | - "size": 10, | |
| 614 | - "_source": ["spu_id", "title", "category_name_text", "category_path"] | |
| 585 | + "query": { | |
| 586 | + "bool": { | |
| 587 | + "filter": [ | |
| 588 | + { "term": { "tenant_id": "162" } } | |
| 589 | + ], | |
| 590 | + "must_not": [ | |
| 591 | + { "exists": { "field": "category1_name" } } | |
| 592 | + ] | |
| 593 | + } | |
| 594 | + }, | |
| 595 | + "size": 10, | |
| 596 | + "_source": ["spu_id", "title", "category_name_text", "category_path"] | |
| 615 | 597 | }' |
| 598 | +``` | |
| 616 | 599 | |
| 617 | -### 5.2 查找有option但没有specifications的文档(数据转换问题) | |
| 600 | +#### 5.2 查找有 option 但没有 specifications 的文档(数据转换问题) | |
| 601 | +```bash | |
| 618 | 602 | curl -u 'saas:4hOaLaf41y2VuI8y' -X GET 'http://localhost:9200/search_products_tenant_162/_search?pretty' -H 'Content-Type: application/json' -d '{ |
| 619 | - "query": { | |
| 620 | - "bool": { | |
| 621 | - "filter": [ | |
| 622 | - { "term": { "tenant_id": "162" } }, | |
| 623 | - { "exists": { "field": "option1_name" } } | |
| 624 | - ], | |
| 625 | - "must_not": [ | |
| 626 | - { "exists": { "field": "specifications" } } | |
| 627 | - ] | |
| 628 | - } | |
| 629 | - }, | |
| 630 | - "size": 10, | |
| 631 | - "_source": ["spu_id", "title", "option1_name", "option2_name", "option3_name", "specifications"] | |
| 603 | + "query": { | |
| 604 | + "bool": { | |
| 605 | + "filter": [ | |
| 606 | + { "term": { "tenant_id": "162" } }, | |
| 607 | + { "exists": { "field": "option1_name" } } | |
| 608 | + ], | |
| 609 | + "must_not": [ | |
| 610 | + { "exists": { "field": "specifications" } } | |
| 611 | + ] | |
| 612 | + } | |
| 613 | + }, | |
| 614 | + "size": 10, | |
| 615 | + "_source": ["spu_id", "title", "option1_name", "option2_name", "option3_name", "specifications"] | |
| 632 | 616 | }' |
| 617 | +``` | |
| 618 | + | |
| 619 | +--- | |
| 633 | 620 | |
| 621 | +## 重排序示例 | |
| 634 | 622 | |
| 635 | -重排序: | |
| 623 | +```bash | |
| 636 | 624 | GET /search_products_tenant_170/_search |
| 637 | 625 | { |
| 638 | 626 | "query": { |
| 639 | - "match": { | |
| 627 | + "match": { | |
| 640 | 628 | "title.en": { |
| 641 | 629 | "query": "quick brown fox", |
| 642 | 630 | "minimum_should_match": "90%" |
| ... | ... | @@ -644,31 +632,36 @@ GET /search_products_tenant_170/_search |
| 644 | 632 | } |
| 645 | 633 | }, |
| 646 | 634 | "rescore": { |
| 647 | - "window_size": 50, | |
| 648 | - "query": { | |
| 635 | + "window_size": 50, | |
| 636 | + "query": { | |
| 649 | 637 | "rescore_query": { |
| 650 | 638 | "match_phrase": { |
| 651 | 639 | "title.en": { |
| 652 | 640 | "query": "quick brown fox", |
| 653 | - "slop": 50 | |
| 641 | + "slop": 50 | |
| 654 | 642 | } |
| 655 | 643 | } |
| 656 | 644 | } |
| 657 | 645 | } |
| 658 | 646 | } |
| 659 | 647 | } |
| 648 | +``` | |
| 660 | 649 | |
| 650 | +--- | |
| 661 | 651 | |
| 662 | -检查某个字段是否存在 | |
| 652 | +## 检查字段是否存在 | |
| 653 | + | |
| 654 | +```bash | |
| 663 | 655 | curl -u 'saas:4hOaLaf41y2VuI8y' -X POST \ |
| 664 | - 'http://localhost:9200/search_products_tenant_163/_count' \ | |
| 665 | - -H 'Content-Type: application/json' \ | |
| 666 | - -d '{ | |
| 656 | +'http://localhost:9200/search_products_tenant_163/_count' \ | |
| 657 | +-H 'Content-Type: application/json' \ | |
| 658 | +-d '{ | |
| 667 | 659 | "query": { |
| 668 | - "bool": { | |
| 669 | - "filter": [ | |
| 670 | - { "exists": { "field": "title_embedding" } } | |
| 671 | - ] | |
| 672 | - } | |
| 660 | + "bool": { | |
| 661 | + "filter": [ | |
| 662 | + { "exists": { "field": "image_embedding" } } | |
| 663 | + ] | |
| 664 | + } | |
| 673 | 665 | } |
| 674 | - }' | |
| 675 | 666 | \ No newline at end of file |
| 667 | +}' | |
| 668 | +``` | |
| 676 | 669 | \ No newline at end of file | ... | ... |
docs/相关性检索优化说明.md
| ... | ... | @@ -260,6 +260,238 @@ python ./scripts/eval_search_quality.py |
| 260 | 260 | 4. 非 `zh/en` 语种字段动态拼接(如 `de/fr/es`) |
| 261 | 261 | |
| 262 | 262 | |
| 263 | +# 搜索pipeline | |
| 264 | +**整体图** | |
| 265 | +这个 pipeline 现在可以理解成一条“先广召回,再逐层收窄、逐层加贵信号”的漏斗: | |
| 266 | + | |
| 267 | +1. Query 解析 | |
| 268 | +2. ES 召回 | |
| 269 | +3. 粗排:只用 ES 内部文本/KNN 信号 | |
| 270 | +4. 款式 SKU 选择 + title suffix | |
| 271 | +5. 精排:轻量 reranker + 文本/KNN 融合 | |
| 272 | +6. 最终 rerank:重 reranker + fine score + 文本/KNN 融合 | |
| 273 | +7. 分页、补全字段、格式化返回 | |
| 274 | + | |
| 275 | +主控代码在 [searcher.py](/data/saas-search/search/searcher.py),打分与 rerank 细节在 [rerank_client.py](/data/saas-search/search/rerank_client.py),配置定义在 [schema.py](/data/saas-search/config/schema.py) 和 [config.yaml](/data/saas-search/config/config.yaml)。 | |
| 276 | + | |
| 277 | +**先看入口怎么决定走哪条路** | |
| 278 | +在 [searcher.py:348](/data/saas-search/search/searcher.py#L348) 开始,`search()` 先读租户语言、开关、窗口大小。 | |
| 279 | +关键判断在 [searcher.py:364](/data/saas-search/search/searcher.py#L364) 到 [searcher.py:372](/data/saas-search/search/searcher.py#L372): | |
| 280 | + | |
| 281 | +- `rerank_window` 现在是 80,见 [config.yaml:256](/data/saas-search/config/config.yaml#L256) | |
| 282 | +- `coarse_rank.input_window` 是 700,`output_window` 是 240,见 [config.yaml:231](/data/saas-search/config/config.yaml#L231) | |
| 283 | +- `fine_rank.input_window` 是 240,`output_window` 是 80,见 [config.yaml:245](/data/saas-search/config/config.yaml#L245) | |
| 284 | + | |
| 285 | +所以如果请求满足 `from_ + size <= rerank_window`,就进入完整漏斗: | |
| 286 | +- ES 实际取前 `700` | |
| 287 | +- 粗排后留 `240` | |
| 288 | +- 精排后留 `80` | |
| 289 | +- 最终 rerank 也只处理这 `80` | |
| 290 | +- 最后再做分页切片 | |
| 291 | + | |
| 292 | +如果请求页超出 80,就不走后面的多阶段漏斗,直接按 ES 原逻辑返回。 | |
| 293 | + | |
| 294 | +这点非常重要,因为它决定了“贵模型只服务头部结果”。 | |
| 295 | + | |
| 296 | +**Step 1:Query 解析阶段** | |
| 297 | +在 [searcher.py:432](/data/saas-search/search/searcher.py#L432) 到 [searcher.py:469](/data/saas-search/search/searcher.py#L469): | |
| 298 | +`query_parser.parse()` 做几件事: | |
| 299 | + | |
| 300 | +- 规范化 query | |
| 301 | +- 检测语言 | |
| 302 | +- 可能做 rewrite | |
| 303 | +- 生成文本向量 | |
| 304 | +- 如果有图搜,还会带图片向量 | |
| 305 | +- 生成翻译结果 | |
| 306 | +- 识别 style intent | |
| 307 | + | |
| 308 | +这一步的结果存在 `parsed_query` 里,后面 ES 查询、style SKU 选择、fine/final rerank 全都依赖它。 | |
| 309 | + | |
| 310 | +**Step 2:ES Query 构建** | |
| 311 | +ES DSL 在 [searcher.py:471](/data/saas-search/search/searcher.py#L471) 开始,通过 [es_query_builder.py:181](/data/saas-search/search/es_query_builder.py#L181) 的 `build_query()` 生成。 | |
| 312 | + | |
| 313 | +这里的核心结构是: | |
| 314 | +- 文本召回 clause | |
| 315 | +- 文本向量 KNN clause | |
| 316 | +- 图片向量 KNN clause | |
| 317 | +- 它们一起放进 `bool.should` | |
| 318 | +- 过滤条件放进 `filter` | |
| 319 | +- facet 的多选条件走 `post_filter` | |
| 320 | + | |
| 321 | +KNN 部分在 [es_query_builder.py:250](/data/saas-search/search/es_query_builder.py#L250) 之后: | |
| 322 | +- 文本向量 clause 名字固定叫 `knn_query` | |
| 323 | +- 图片向量 clause 名字固定叫 `image_knn_query` | |
| 324 | + | |
| 325 | +而文本召回那边,后续 fusion 代码约定会去读: | |
| 326 | +- 原始 query 的 named query:`base_query` | |
| 327 | +- 翻译 query 的 named query:`base_query_trans_*` | |
| 328 | + | |
| 329 | +也就是说,后面的粗排/精排/最终 rerank,并不是重新理解 ES score,而是从 `matched_queries` 里把这些命名子信号拆出来自己重算。 | |
| 330 | + | |
| 331 | +**Step 3:ES 召回** | |
| 332 | +在 [searcher.py:579](/data/saas-search/search/searcher.py#L579) 到 [searcher.py:627](/data/saas-search/search/searcher.py#L627)。 | |
| 333 | + | |
| 334 | +这里有个很关键的工程优化: | |
| 335 | +如果在 rerank window 内,第一次 ES 拉取时会把 `_source` 关掉,只取排序必需信号,见 [searcher.py:517](/data/saas-search/search/searcher.py#L517) 到 [searcher.py:523](/data/saas-search/search/searcher.py#L523)。 | |
| 336 | + | |
| 337 | +原因是: | |
| 338 | +- 粗排先只需要 `_score` 和 `matched_queries` | |
| 339 | +- 不需要一上来把 700 条完整商品详情都拉回来 | |
| 340 | +- 等粗排收窄后,再补 fine/final rerank 需要的字段 | |
| 341 | + | |
| 342 | +这是现在这条 pipeline 很核心的性能设计点。 | |
| 343 | + | |
| 344 | +**Step 4:粗排** | |
| 345 | +粗排入口在 [searcher.py:638](/data/saas-search/search/searcher.py#L638),真正的打分在 [rerank_client.py:348](/data/saas-search/search/rerank_client.py#L348) 的 `coarse_resort_hits()`。 | |
| 346 | + | |
| 347 | +粗排只看两类信号: | |
| 348 | +- `text_score` | |
| 349 | +- `knn_score` | |
| 350 | + | |
| 351 | +它们先都从统一 helper `_build_hit_signal_bundle()` 里拿,见 [rerank_client.py:246](/data/saas-search/search/rerank_client.py#L246)。 | |
| 352 | + | |
| 353 | +文本分怎么来,见 [rerank_client.py:200](/data/saas-search/search/rerank_client.py#L200): | |
| 354 | +- `source_score = matched_queries["base_query"]` | |
| 355 | +- `translation_score = max(base_query_trans_*)` | |
| 356 | +- `weighted_translation = 0.8 * translation_score` | |
| 357 | +- `primary_text = max(source, weighted_translation)` | |
| 358 | +- `support_text = 另一路` | |
| 359 | +- `text_score = primary_text + 0.25 * support_text` | |
| 360 | + | |
| 361 | +这就是一个 text dismax 思路: | |
| 362 | +原 query 是主路,翻译 query 是辅助路,但不是简单相加。 | |
| 363 | + | |
| 364 | +向量分怎么来,见 [rerank_client.py:156](/data/saas-search/search/rerank_client.py#L156): | |
| 365 | +- `text_knn_score` | |
| 366 | +- `image_knn_score` | |
| 367 | +- 分别乘自己的 weight | |
| 368 | +- 取强的一路做主路 | |
| 369 | +- 弱的一路按 `knn_tie_breaker` 做辅助 | |
| 370 | + | |
| 371 | +然后粗排融合公式在 [rerank_client.py:334](/data/saas-search/search/rerank_client.py#L334): | |
| 372 | +- `coarse_score = (text_score + text_bias)^text_exponent * (knn_score + knn_bias)^knn_exponent` | |
| 373 | + | |
| 374 | +配置定义在 [schema.py:124](/data/saas-search/config/schema.py#L124) 和 [config.yaml:231](/data/saas-search/config/config.yaml#L231)。 | |
| 375 | + | |
| 376 | +算完后: | |
| 377 | +- 写入 `hit["_coarse_score"]` | |
| 378 | +- 按 `_coarse_score` 排序 | |
| 379 | +- 留前 240,见 [searcher.py:645](/data/saas-search/search/searcher.py#L645) | |
| 380 | + | |
| 381 | +**Step 5:粗排后补字段 + SKU 选择** | |
| 382 | +粗排完以后,`searcher` 会按 doc template 反推 fine/final rerank 需要哪些 `_source` 字段,然后只补这些字段,见 [searcher.py:669](/data/saas-search/search/searcher.py#L669)。 | |
| 383 | + | |
| 384 | +之后才做 style SKU 选择,见 [searcher.py:696](/data/saas-search/search/searcher.py#L696)。 | |
| 385 | + | |
| 386 | +为什么放这里? | |
| 387 | +因为现在 fine rank 也是 reranker,它也要吃 title suffix。 | |
| 388 | +而 suffix 是 SKU 选择之后写到 hit 上的 `_style_rerank_suffix`。 | |
| 389 | +真正把 suffix 拼进 doc 文本的地方在 [rerank_client.py:65](/data/saas-search/search/rerank_client.py#L65) 到 [rerank_client.py:74](/data/saas-search/search/rerank_client.py#L74)。 | |
| 390 | + | |
| 391 | +所以顺序必须是: | |
| 392 | +- 先粗排 | |
| 393 | +- 再选 SKU | |
| 394 | +- 再用带 suffix 的 title 去跑 fine/final rerank | |
| 395 | + | |
| 396 | +**Step 6:精排** | |
| 397 | +入口在 [searcher.py:711](/data/saas-search/search/searcher.py#L711),实现是 [rerank_client.py:603](/data/saas-search/search/rerank_client.py#L603) 的 `run_lightweight_rerank()`。 | |
| 398 | + | |
| 399 | +它会做三件事: | |
| 400 | + | |
| 401 | +1. 用 `build_docs_from_hits()` 把每条商品变成 reranker 输入文本 | |
| 402 | +2. 用 `service_profile="fine"` 调轻量服务 | |
| 403 | +3. 不再只按 `fine_score` 排,而是按融合后的 `_fine_fused_score` 排 | |
| 404 | + | |
| 405 | +精排融合公式现在是: | |
| 406 | +- `fine_stage_score = fine_factor * text_factor * knn_factor * style_boost` | |
| 407 | + | |
| 408 | +具体公共计算在 [rerank_client.py:286](/data/saas-search/search/rerank_client.py#L286) 的 `_compute_multiplicative_fusion()`: | |
| 409 | +- `fine_factor = (fine_score + fine_bias)^fine_exponent` | |
| 410 | +- `text_factor = (text_score + text_bias)^text_exponent` | |
| 411 | +- `knn_factor = (knn_score + knn_bias)^knn_exponent` | |
| 412 | +- 如果命中了 selected SKU,再乘 style boost | |
| 413 | + | |
| 414 | +写回 hit 的字段见 [rerank_client.py:655](/data/saas-search/search/rerank_client.py#L655): | |
| 415 | +- `_fine_score` | |
| 416 | +- `_fine_fused_score` | |
| 417 | +- `_text_score` | |
| 418 | +- `_knn_score` | |
| 419 | + | |
| 420 | +排序逻辑在 [rerank_client.py:683](/data/saas-search/search/rerank_client.py#L683): | |
| 421 | +按 `_fine_fused_score` 降序排,然后留前 80,见 [searcher.py:727](/data/saas-search/search/searcher.py#L727)。 | |
| 422 | + | |
| 423 | +这就是你这次特别关心的点:现在 fine rank 已经不是“模型裸分排序”,而是“模型分 + ES 文本/KNN 信号融合后排序”。 | |
| 424 | + | |
| 425 | +**Step 7:最终 rerank** | |
| 426 | +入口在 [searcher.py:767](/data/saas-search/search/searcher.py#L767),实现是 [rerank_client.py:538](/data/saas-search/search/rerank_client.py#L538) 的 `run_rerank()`。 | |
| 427 | + | |
| 428 | +它和 fine rank 很像,但多了一个更重的模型分 `rerank_score`。 | |
| 429 | +最终公式是: | |
| 430 | + | |
| 431 | +- `final_score = rerank_factor * fine_factor * text_factor * knn_factor * style_boost` | |
| 432 | + | |
| 433 | +也就是: | |
| 434 | +- fine rank 产生的 `fine_score` 不会丢 | |
| 435 | +- 到最终 rerank 时,它会继续作为一个乘法项参与最终融合 | |
| 436 | + | |
| 437 | +这个逻辑在 [rerank_client.py:468](/data/saas-search/search/rerank_client.py#L468) 到 [rerank_client.py:476](/data/saas-search/search/rerank_client.py#L476)。 | |
| 438 | + | |
| 439 | +算完后写入: | |
| 440 | +- `_rerank_score` | |
| 441 | +- `_fused_score` | |
| 442 | + | |
| 443 | +然后按 `_fused_score` 排序,见 [rerank_client.py:531](/data/saas-search/search/rerank_client.py#L531)。 | |
| 444 | + | |
| 445 | +这里你可以把它理解成: | |
| 446 | +- fine rank 负责“轻量快速筛一遍,把 240 缩成 80” | |
| 447 | +- 最终 rerank 负责“用更贵模型做最终拍板” | |
| 448 | +- 但最终拍板时,不会忽略 fine rank 结果,而是把 fine score 当成一个先验信号保留进去 | |
| 449 | + | |
| 450 | +**Step 8:分页与字段补全** | |
| 451 | +多阶段排序只在头部窗口内完成。 | |
| 452 | +真正返回给用户前,在 [searcher.py:828](/data/saas-search/search/searcher.py#L828) 之后还会做两件事: | |
| 453 | + | |
| 454 | +- 先按 `from_:from_+size` 对最终 80 条切片 | |
| 455 | +- 再按用户原始 `_source` 需求补回页面真正要显示的字段,见 [searcher.py:859](/data/saas-search/search/searcher.py#L859) | |
| 456 | + | |
| 457 | +所以这条链路是“三次不同目的的数据访问”: | |
| 458 | + | |
| 459 | +- 第一次 ES:只要排序信号 | |
| 460 | +- 第二次按 id 回填:只要 fine/final rerank 需要字段 | |
| 461 | +- 第三次按页面 ids 回填:只要最终页面显示字段 | |
| 462 | + | |
| 463 | +这也是为什么它性能上比“一次全量拉 700 条完整文档”更合理。 | |
| 464 | + | |
| 465 | +**Step 9:结果格式化与 debug funnel** | |
| 466 | +最后在 [searcher.py:906](/data/saas-search/search/searcher.py#L906) 进入结果处理。 | |
| 467 | +这里会把每个商品的阶段信息组装成 `ranking_funnel`,见 [searcher.py:1068](/data/saas-search/search/searcher.py#L1068): | |
| 468 | + | |
| 469 | +- `es_recall` | |
| 470 | +- `coarse_rank` | |
| 471 | +- `fine_rank` | |
| 472 | +- `rerank` | |
| 473 | +- `final_page` | |
| 474 | + | |
| 475 | +其中: | |
| 476 | +- coarse stage 主要保留 text/translation/knn 的拆分信号 | |
| 477 | +- fine/rerank stage 现在都保留 `fusion_inputs`、`fusion_factors`、`fusion_summary` | |
| 478 | +- `fusion_summary` 来自真实计算过程本身,见 [rerank_client.py:265](/data/saas-search/search/rerank_client.py#L265) | |
| 479 | + | |
| 480 | +这点很重要,因为现在“实际排序逻辑”和“debug 展示逻辑”是同源的,不是两套各写一份。 | |
| 481 | + | |
| 482 | +**一句话总结这条 pipeline** | |
| 483 | +这条 pipeline 的本质是: | |
| 484 | + | |
| 485 | +- ES 负责便宜的大范围召回 | |
| 486 | +- 粗排负责只靠 ES 内置信号先做一次结构化筛选 | |
| 487 | +- style SKU 选择负责把商品文本改造成更适合 reranker 理解的输入 | |
| 488 | +- fine rank 负责用轻模型把候选进一步压缩 | |
| 489 | +- final rerank 负责用重模型做最终判定 | |
| 490 | +- 每一层都尽量复用前一层信号,而不是推翻重来 | |
| 491 | + | |
| 492 | +如果你愿意,我下一步可以继续按“一个具体 query 的真实流转样例”来讲,比如假设用户搜 `black dress`,我把它从 `parsed_query`、ES named queries、coarse/fine/final 的每个分数怎么出来,完整手推一遍。 | |
| 493 | + | |
| 494 | + | |
| 263 | 495 | |
| 264 | 496 | ## reranker方面: |
| 265 | 497 | BAAI/bge-reranker-v2-m3的一个严重badcase: | ... | ... |
frontend/static/js/app.js
| ... | ... | @@ -546,22 +546,25 @@ function buildProductDebugHtml({ debug, result, spuId, tenantId }) { |
| 546 | 546 | ${buildStageCard('Fine Rank', 'Lightweight reranker output', [ |
| 547 | 547 | { label: 'rank', value: fineStage.rank ?? 'N/A' }, |
| 548 | 548 | { label: 'rank_change', value: fineStage.rank_change ?? 'N/A' }, |
| 549 | - { label: 'fine_score', value: formatDebugNumber(fineStage.score ?? debug.fine_score) }, | |
| 550 | - ], renderJsonDetails('Fine Input', fineStage.rerank_input ?? debug.rerank_input, false))} | |
| 549 | + { label: 'stage_score', value: formatDebugNumber(fineStage.score ?? debug.score) }, | |
| 550 | + { label: 'fine_score', value: formatDebugNumber(fineStage.fine_score ?? debug.fine_score) }, | |
| 551 | + { label: 'text_score', value: formatDebugNumber(fineStage.text_score ?? debug.text_score) }, | |
| 552 | + { label: 'knn_score', value: formatDebugNumber(fineStage.knn_score ?? debug.knn_score) }, | |
| 553 | + ], `${renderJsonDetails('Fine Fusion', fineStage.fusion_summary || debug.fusion_summary || fineStage.fusion_factors, false)}${renderJsonDetails('Fine Input', fineStage.rerank_input ?? debug.rerank_input, false)}`)} | |
| 551 | 554 | ${buildStageCard('Final Rerank', 'Heavy reranker + final fusion', [ |
| 552 | 555 | { label: 'rank', value: rerankStage.rank ?? finalPageStage.rank ?? debug.final_rank ?? 'N/A' }, |
| 553 | 556 | { label: 'rank_change', value: rerankStage.rank_change ?? finalPageStage.rank_change ?? 'N/A' }, |
| 557 | + { label: 'stage_score', value: formatDebugNumber(rerankStage.score ?? rerankStage.fused_score ?? debug.score) }, | |
| 554 | 558 | { label: 'rerank_score', value: formatDebugNumber(rerankStage.rerank_score ?? debug.rerank_score) }, |
| 559 | + { label: 'fine_score', value: formatDebugNumber(rerankStage.fine_score ?? debug.fine_score) }, | |
| 555 | 560 | { label: 'text_score', value: formatDebugNumber(rerankStage.text_score ?? debug.text_score) }, |
| 556 | 561 | { label: 'knn_score', value: formatDebugNumber(rerankStage.knn_score ?? debug.knn_score) }, |
| 557 | - { label: 'text_source', value: formatDebugNumber(rerankStage.signals?.text_source_score ?? debug.text_source_score) }, | |
| 558 | - { label: 'text_translation', value: formatDebugNumber(rerankStage.signals?.text_translation_score ?? debug.text_translation_score) }, | |
| 559 | 562 | { label: 'fine_factor', value: formatDebugNumber(rerankStage.fine_factor ?? debug.fine_factor) }, |
| 560 | 563 | { label: 'rerank_factor', value: formatDebugNumber(rerankStage.rerank_factor ?? debug.rerank_factor) }, |
| 561 | 564 | { label: 'text_factor', value: formatDebugNumber(rerankStage.text_factor ?? debug.text_factor) }, |
| 562 | 565 | { label: 'knn_factor', value: formatDebugNumber(rerankStage.knn_factor ?? debug.knn_factor) }, |
| 563 | 566 | { label: 'fused_score', value: formatDebugNumber(rerankStage.fused_score ?? debug.fused_score) }, |
| 564 | - ], renderJsonDetails('Rerank Signals', rerankStage.signals, false))} | |
| 567 | + ], `${renderJsonDetails('Final Fusion', rerankStage.fusion_summary || debug.fusion_summary || rerankStage.fusion_factors, false)}${renderJsonDetails('Rerank Signals', rerankStage.signals, false)}`)} | |
| 565 | 568 | </div> |
| 566 | 569 | `; |
| 567 | 570 | ... | ... |
search/rerank_client.py
| ... | ... | @@ -239,22 +239,96 @@ def _collect_text_score_components(matched_queries: Any, fallback_es_score: floa |
| 239 | 239 | } |
| 240 | 240 | |
| 241 | 241 | |
| 242 | -def _multiply_fusion_factors( | |
| 243 | - rerank_score: float, | |
| 244 | - fine_score: Optional[float], | |
| 242 | +def _format_debug_float(value: float) -> str: | |
| 243 | + return f"{float(value):.6g}" | |
| 244 | + | |
| 245 | + | |
| 246 | +def _build_hit_signal_bundle( | |
| 247 | + hit: Dict[str, Any], | |
| 248 | + fusion: CoarseRankFusionConfig | RerankFusionConfig, | |
| 249 | +) -> Dict[str, Any]: | |
| 250 | + es_score = _to_score(hit.get("_score")) | |
| 251 | + matched_queries = hit.get("matched_queries") | |
| 252 | + text_components = _collect_text_score_components(matched_queries, es_score) | |
| 253 | + knn_components = _collect_knn_score_components(matched_queries, fusion) | |
| 254 | + return { | |
| 255 | + "doc_id": hit.get("_id"), | |
| 256 | + "es_score": es_score, | |
| 257 | + "matched_queries": matched_queries, | |
| 258 | + "text_components": text_components, | |
| 259 | + "knn_components": knn_components, | |
| 260 | + "text_score": text_components["text_score"], | |
| 261 | + "knn_score": knn_components["knn_score"], | |
| 262 | + } | |
| 263 | + | |
| 264 | + | |
| 265 | +def _build_formula_summary( | |
| 266 | + term_rows: List[Dict[str, Any]], | |
| 267 | + style_boost: float, | |
| 268 | + final_score: float, | |
| 269 | +) -> str: | |
| 270 | + segments = [ | |
| 271 | + ( | |
| 272 | + f"{row['name']}=(" | |
| 273 | + f"{_format_debug_float(row['raw_score'])}" | |
| 274 | + f"+{_format_debug_float(row['bias'])})" | |
| 275 | + f"^{_format_debug_float(row['exponent'])}" | |
| 276 | + f"={_format_debug_float(row['factor'])}" | |
| 277 | + ) | |
| 278 | + for row in term_rows | |
| 279 | + ] | |
| 280 | + if style_boost != 1.0: | |
| 281 | + segments.append(f"style_boost={_format_debug_float(style_boost)}") | |
| 282 | + segments.append(f"final={_format_debug_float(final_score)}") | |
| 283 | + return " | ".join(segments) | |
| 284 | + | |
| 285 | + | |
| 286 | +def _compute_multiplicative_fusion( | |
| 287 | + *, | |
| 245 | 288 | text_score: float, |
| 246 | 289 | knn_score: float, |
| 247 | 290 | fusion: RerankFusionConfig, |
| 248 | -) -> Tuple[float, float, float, float, float]: | |
| 249 | - """(rerank_factor, fine_factor, text_factor, knn_factor, fused_without_style_boost).""" | |
| 250 | - r = (max(rerank_score, 0.0) + fusion.rerank_bias) ** fusion.rerank_exponent | |
| 251 | - if fine_score is None: | |
| 252 | - f = 1.0 | |
| 253 | - else: | |
| 254 | - f = (max(fine_score, 0.0) + fusion.fine_bias) ** fusion.fine_exponent | |
| 255 | - t = (max(text_score, 0.0) + fusion.text_bias) ** fusion.text_exponent | |
| 256 | - k = (max(knn_score, 0.0) + fusion.knn_bias) ** fusion.knn_exponent | |
| 257 | - return r, f, t, k, r * f * t * k | |
| 291 | + rerank_score: Optional[float] = None, | |
| 292 | + fine_score: Optional[float] = None, | |
| 293 | + style_boost: float = 1.0, | |
| 294 | +) -> Dict[str, Any]: | |
| 295 | + term_rows: List[Dict[str, Any]] = [] | |
| 296 | + | |
| 297 | + def _add_term(name: str, raw_score: Optional[float], bias: float, exponent: float) -> None: | |
| 298 | + if raw_score is None: | |
| 299 | + return | |
| 300 | + factor = (max(float(raw_score), 0.0) + bias) ** exponent | |
| 301 | + term_rows.append( | |
| 302 | + { | |
| 303 | + "name": name, | |
| 304 | + "raw_score": float(raw_score), | |
| 305 | + "bias": float(bias), | |
| 306 | + "exponent": float(exponent), | |
| 307 | + "factor": factor, | |
| 308 | + } | |
| 309 | + ) | |
| 310 | + | |
| 311 | + _add_term("rerank_score", rerank_score, fusion.rerank_bias, fusion.rerank_exponent) | |
| 312 | + _add_term("fine_score", fine_score, fusion.fine_bias, fusion.fine_exponent) | |
| 313 | + _add_term("text_score", text_score, fusion.text_bias, fusion.text_exponent) | |
| 314 | + _add_term("knn_score", knn_score, fusion.knn_bias, fusion.knn_exponent) | |
| 315 | + | |
| 316 | + fused = 1.0 | |
| 317 | + factors: Dict[str, float] = {} | |
| 318 | + inputs: Dict[str, float] = {} | |
| 319 | + for row in term_rows: | |
| 320 | + fused *= row["factor"] | |
| 321 | + factors[row["name"]] = row["factor"] | |
| 322 | + inputs[row["name"]] = row["raw_score"] | |
| 323 | + fused *= style_boost | |
| 324 | + factors["style_boost"] = style_boost | |
| 325 | + | |
| 326 | + return { | |
| 327 | + "inputs": inputs, | |
| 328 | + "factors": factors, | |
| 329 | + "score": fused, | |
| 330 | + "summary": _build_formula_summary(term_rows, style_boost, fused), | |
| 331 | + } | |
| 258 | 332 | |
| 259 | 333 | |
| 260 | 334 | def _multiply_coarse_fusion_factors( |
| ... | ... | @@ -283,12 +357,13 @@ def coarse_resort_hits( |
| 283 | 357 | f = fusion or CoarseRankFusionConfig() |
| 284 | 358 | coarse_debug: List[Dict[str, Any]] = [] if debug else [] |
| 285 | 359 | for hit in es_hits: |
| 286 | - es_score = _to_score(hit.get("_score")) | |
| 287 | - matched_queries = hit.get("matched_queries") | |
| 288 | - knn_components = _collect_knn_score_components(matched_queries, f) | |
| 289 | - text_components = _collect_text_score_components(matched_queries, es_score) | |
| 290 | - text_score = text_components["text_score"] | |
| 291 | - knn_score = knn_components["knn_score"] | |
| 360 | + signal_bundle = _build_hit_signal_bundle(hit, f) | |
| 361 | + es_score = signal_bundle["es_score"] | |
| 362 | + matched_queries = signal_bundle["matched_queries"] | |
| 363 | + text_components = signal_bundle["text_components"] | |
| 364 | + knn_components = signal_bundle["knn_components"] | |
| 365 | + text_score = signal_bundle["text_score"] | |
| 366 | + knn_score = signal_bundle["knn_score"] | |
| 292 | 367 | text_factor, knn_factor, coarse_score = _multiply_coarse_fusion_factors( |
| 293 | 368 | text_score=text_score, |
| 294 | 369 | knn_score=knn_score, |
| ... | ... | @@ -372,77 +447,81 @@ def fuse_scores_and_resort( |
| 372 | 447 | n = len(es_hits) |
| 373 | 448 | if n == 0 or len(rerank_scores) != n: |
| 374 | 449 | return [] |
| 375 | - if fine_scores is not None and len(fine_scores) != n: | |
| 376 | - fine_scores = None | |
| 377 | - | |
| 378 | 450 | f = fusion or RerankFusionConfig() |
| 379 | 451 | fused_debug: List[Dict[str, Any]] = [] if debug else [] |
| 380 | 452 | |
| 381 | 453 | for idx, hit in enumerate(es_hits): |
| 382 | - es_score = _to_score(hit.get("_score")) | |
| 454 | + signal_bundle = _build_hit_signal_bundle(hit, f) | |
| 455 | + text_components = signal_bundle["text_components"] | |
| 456 | + knn_components = signal_bundle["knn_components"] | |
| 457 | + text_score = signal_bundle["text_score"] | |
| 458 | + knn_score = signal_bundle["knn_score"] | |
| 383 | 459 | rerank_score = _to_score(rerank_scores[idx]) |
| 384 | - fine_score = _to_score(fine_scores[idx]) if fine_scores is not None else _to_score(hit.get("_fine_score")) | |
| 385 | - matched_queries = hit.get("matched_queries") | |
| 386 | - knn_components = _collect_knn_score_components(matched_queries, f) | |
| 387 | - knn_score = knn_components["knn_score"] | |
| 388 | - text_components = _collect_text_score_components(matched_queries, es_score) | |
| 389 | - text_score = text_components["text_score"] | |
| 390 | - rerank_factor, fine_factor, text_factor, knn_factor, fused = _multiply_fusion_factors( | |
| 391 | - rerank_score, fine_score if fine_scores is not None or "_fine_score" in hit else None, text_score, knn_score, f | |
| 460 | + fine_score_raw = ( | |
| 461 | + _to_score(fine_scores[idx]) | |
| 462 | + if fine_scores is not None and len(fine_scores) == n | |
| 463 | + else _to_score(hit.get("_fine_score")) | |
| 392 | 464 | ) |
| 465 | + fine_score = fine_score_raw if (fine_scores is not None and len(fine_scores) == n) or "_fine_score" in hit else None | |
| 393 | 466 | sku_selected = _has_selected_sku(hit) |
| 394 | 467 | style_boost = style_intent_selected_sku_boost if sku_selected else 1.0 |
| 395 | - fused *= style_boost | |
| 468 | + fusion_result = _compute_multiplicative_fusion( | |
| 469 | + rerank_score=rerank_score, | |
| 470 | + fine_score=fine_score, | |
| 471 | + text_score=text_score, | |
| 472 | + knn_score=knn_score, | |
| 473 | + fusion=f, | |
| 474 | + style_boost=style_boost, | |
| 475 | + ) | |
| 476 | + fused = fusion_result["score"] | |
| 396 | 477 | |
| 397 | 478 | hit["_original_score"] = hit.get("_score") |
| 398 | 479 | hit["_rerank_score"] = rerank_score |
| 399 | - hit["_fine_score"] = fine_score | |
| 480 | + if fine_score is not None: | |
| 481 | + hit["_fine_score"] = fine_score | |
| 400 | 482 | hit["_text_score"] = text_score |
| 401 | 483 | hit["_knn_score"] = knn_score |
| 402 | 484 | hit["_text_knn_score"] = knn_components["text_knn_score"] |
| 403 | 485 | hit["_image_knn_score"] = knn_components["image_knn_score"] |
| 404 | 486 | hit["_fused_score"] = fused |
| 405 | 487 | hit["_style_intent_selected_sku_boost"] = style_boost |
| 406 | - if debug: | |
| 407 | - hit["_text_source_score"] = text_components["source_score"] | |
| 408 | - hit["_text_translation_score"] = text_components["translation_score"] | |
| 409 | - hit["_text_primary_score"] = text_components["primary_text_score"] | |
| 410 | - hit["_text_support_score"] = text_components["support_text_score"] | |
| 411 | - hit["_knn_primary_score"] = knn_components["primary_knn_score"] | |
| 412 | - hit["_knn_support_score"] = knn_components["support_knn_score"] | |
| 413 | 488 | |
| 414 | 489 | if debug: |
| 415 | 490 | debug_entry = { |
| 416 | 491 | "doc_id": hit.get("_id"), |
| 417 | - "es_score": es_score, | |
| 492 | + "score": fused, | |
| 493 | + "es_score": signal_bundle["es_score"], | |
| 418 | 494 | "rerank_score": rerank_score, |
| 419 | 495 | "fine_score": fine_score, |
| 420 | 496 | "text_score": text_score, |
| 497 | + "knn_score": knn_score, | |
| 498 | + "fusion_inputs": fusion_result["inputs"], | |
| 499 | + "fusion_factors": fusion_result["factors"], | |
| 500 | + "fusion_summary": fusion_result["summary"], | |
| 421 | 501 | "text_source_score": text_components["source_score"], |
| 422 | 502 | "text_translation_score": text_components["translation_score"], |
| 423 | 503 | "text_weighted_source_score": text_components["weighted_source_score"], |
| 424 | 504 | "text_weighted_translation_score": text_components["weighted_translation_score"], |
| 425 | 505 | "text_primary_score": text_components["primary_text_score"], |
| 426 | 506 | "text_support_score": text_components["support_text_score"], |
| 427 | - "text_score_fallback_to_es": ( | |
| 428 | - text_score == es_score | |
| 429 | - and text_components["source_score"] <= 0.0 | |
| 430 | - and text_components["translation_score"] <= 0.0 | |
| 431 | - ), | |
| 432 | 507 | "text_knn_score": knn_components["text_knn_score"], |
| 433 | 508 | "image_knn_score": knn_components["image_knn_score"], |
| 434 | 509 | "weighted_text_knn_score": knn_components["weighted_text_knn_score"], |
| 435 | 510 | "weighted_image_knn_score": knn_components["weighted_image_knn_score"], |
| 436 | 511 | "knn_primary_score": knn_components["primary_knn_score"], |
| 437 | 512 | "knn_support_score": knn_components["support_knn_score"], |
| 438 | - "knn_score": knn_score, | |
| 439 | - "rerank_factor": rerank_factor, | |
| 440 | - "fine_factor": fine_factor, | |
| 441 | - "text_factor": text_factor, | |
| 442 | - "knn_factor": knn_factor, | |
| 513 | + "text_score_fallback_to_es": ( | |
| 514 | + text_score == signal_bundle["es_score"] | |
| 515 | + and text_components["source_score"] <= 0.0 | |
| 516 | + and text_components["translation_score"] <= 0.0 | |
| 517 | + ), | |
| 518 | + "rerank_factor": fusion_result["factors"].get("rerank_score"), | |
| 519 | + "fine_factor": fusion_result["factors"].get("fine_score"), | |
| 520 | + "text_factor": fusion_result["factors"].get("text_score"), | |
| 521 | + "knn_factor": fusion_result["factors"].get("knn_score"), | |
| 443 | 522 | "style_intent_selected_sku": sku_selected, |
| 444 | 523 | "style_intent_selected_sku_boost": style_boost, |
| 445 | - "matched_queries": matched_queries, | |
| 524 | + "matched_queries": signal_bundle["matched_queries"], | |
| 446 | 525 | "fused_score": fused, |
| 447 | 526 | } |
| 448 | 527 | if rerank_debug_rows is not None and idx < len(rerank_debug_rows): |
| ... | ... | @@ -530,9 +609,11 @@ def run_lightweight_rerank( |
| 530 | 609 | rerank_doc_template: str = "{title}", |
| 531 | 610 | top_n: Optional[int] = None, |
| 532 | 611 | debug: bool = False, |
| 612 | + fusion: Optional[RerankFusionConfig] = None, | |
| 613 | + style_intent_selected_sku_boost: float = 1.2, | |
| 533 | 614 | service_profile: Optional[str] = "fine", |
| 534 | 615 | ) -> Tuple[Optional[List[float]], Optional[Dict[str, Any]], List[Dict[str, Any]]]: |
| 535 | - """Call lightweight reranker and attach scores to hits without final fusion.""" | |
| 616 | + """Call lightweight reranker and rank by lightweight-model fusion.""" | |
| 536 | 617 | if not es_hits: |
| 537 | 618 | return [], {}, [] |
| 538 | 619 | |
| ... | ... | @@ -554,18 +635,50 @@ def run_lightweight_rerank( |
| 554 | 635 | if scores is None or len(scores) != len(es_hits): |
| 555 | 636 | return None, None, [] |
| 556 | 637 | |
| 638 | + f = fusion or RerankFusionConfig() | |
| 557 | 639 | debug_rows: List[Dict[str, Any]] = [] if debug else [] |
| 558 | 640 | for idx, hit in enumerate(es_hits): |
| 641 | + signal_bundle = _build_hit_signal_bundle(hit, f) | |
| 642 | + text_score = signal_bundle["text_score"] | |
| 643 | + knn_score = signal_bundle["knn_score"] | |
| 559 | 644 | fine_score = _to_score(scores[idx]) |
| 645 | + sku_selected = _has_selected_sku(hit) | |
| 646 | + style_boost = style_intent_selected_sku_boost if sku_selected else 1.0 | |
| 647 | + fusion_result = _compute_multiplicative_fusion( | |
| 648 | + fine_score=fine_score, | |
| 649 | + text_score=text_score, | |
| 650 | + knn_score=knn_score, | |
| 651 | + fusion=f, | |
| 652 | + style_boost=style_boost, | |
| 653 | + ) | |
| 654 | + | |
| 560 | 655 | hit["_fine_score"] = fine_score |
| 656 | + hit["_fine_fused_score"] = fusion_result["score"] | |
| 657 | + hit["_text_score"] = text_score | |
| 658 | + hit["_knn_score"] = knn_score | |
| 659 | + hit["_text_knn_score"] = signal_bundle["knn_components"]["text_knn_score"] | |
| 660 | + hit["_image_knn_score"] = signal_bundle["knn_components"]["image_knn_score"] | |
| 661 | + hit["_style_intent_selected_sku_boost"] = style_boost | |
| 662 | + | |
| 561 | 663 | if debug: |
| 562 | 664 | row: Dict[str, Any] = { |
| 563 | 665 | "doc_id": hit.get("_id"), |
| 666 | + "score": fusion_result["score"], | |
| 564 | 667 | "fine_score": fine_score, |
| 668 | + "text_score": text_score, | |
| 669 | + "knn_score": knn_score, | |
| 670 | + "fusion_inputs": fusion_result["inputs"], | |
| 671 | + "fusion_factors": fusion_result["factors"], | |
| 672 | + "fusion_summary": fusion_result["summary"], | |
| 673 | + "fine_factor": fusion_result["factors"].get("fine_score"), | |
| 674 | + "text_factor": fusion_result["factors"].get("text_score"), | |
| 675 | + "knn_factor": fusion_result["factors"].get("knn_score"), | |
| 676 | + "style_intent_selected_sku": sku_selected, | |
| 677 | + "style_intent_selected_sku_boost": style_boost, | |
| 565 | 678 | } |
| 566 | 679 | if rerank_debug_rows is not None and idx < len(rerank_debug_rows): |
| 567 | 680 | row["rerank_input"] = rerank_debug_rows[idx] |
| 568 | 681 | debug_rows.append(row) |
| 569 | 682 | |
| 570 | - es_hits.sort(key=lambda h: h.get("_fine_score", 0.0), reverse=True) | |
| 683 | + es_hits.sort(key=lambda h: h.get("_fine_fused_score", h.get("_fine_score", 0.0)), reverse=True) | |
| 571 | 684 | return scores, meta, debug_rows | ... | ... |
search/searcher.py
| ... | ... | @@ -720,6 +720,8 @@ class Searcher: |
| 720 | 720 | rerank_doc_template=fine_doc_template, |
| 721 | 721 | top_n=fine_output_window, |
| 722 | 722 | debug=debug, |
| 723 | + fusion=rc.fusion, | |
| 724 | + style_intent_selected_sku_boost=self.config.query_config.style_intent_selected_sku_boost, | |
| 723 | 725 | service_profile=fine_cfg.service_profile, |
| 724 | 726 | ) |
| 725 | 727 | if fine_scores is not None: |
| ... | ... | @@ -745,6 +747,7 @@ class Searcher: |
| 745 | 747 | "docs_out": len(hits), |
| 746 | 748 | "top_n": fine_output_window, |
| 747 | 749 | "meta": fine_meta, |
| 750 | + "fusion": asdict(rc.fusion), | |
| 748 | 751 | } |
| 749 | 752 | context.store_intermediate_result("fine_rank_scores", fine_debug_rows) |
| 750 | 753 | context.logger.info( |
| ... | ... | @@ -781,7 +784,6 @@ class Searcher: |
| 781 | 784 | top_n=(from_ + size), |
| 782 | 785 | debug=debug, |
| 783 | 786 | fusion=rc.fusion, |
| 784 | - fine_scores=fine_scores[:len(final_input)] if fine_scores is not None else None, | |
| 785 | 787 | service_profile=rc.service_profile, |
| 786 | 788 | style_intent_selected_sku_boost=self.config.query_config.style_intent_selected_sku_boost, |
| 787 | 789 | ) |
| ... | ... | @@ -1026,18 +1028,14 @@ class Searcher: |
| 1026 | 1028 | # 若存在重排调试信息,则补充 doc 级别的融合分数信息 |
| 1027 | 1029 | if rerank_debug: |
| 1028 | 1030 | debug_entry["doc_id"] = rerank_debug.get("doc_id") |
| 1029 | - # 与 rerank_client 中字段保持一致,便于前端直接使用 | |
| 1031 | + debug_entry["score"] = rerank_debug.get("score") | |
| 1030 | 1032 | debug_entry["rerank_score"] = rerank_debug.get("rerank_score") |
| 1031 | 1033 | debug_entry["fine_score"] = rerank_debug.get("fine_score") |
| 1032 | 1034 | debug_entry["text_score"] = rerank_debug.get("text_score") |
| 1033 | - debug_entry["text_source_score"] = rerank_debug.get("text_source_score") | |
| 1034 | - debug_entry["text_translation_score"] = rerank_debug.get("text_translation_score") | |
| 1035 | - debug_entry["text_weighted_source_score"] = rerank_debug.get("text_weighted_source_score") | |
| 1036 | - debug_entry["text_weighted_translation_score"] = rerank_debug.get("text_weighted_translation_score") | |
| 1037 | - debug_entry["text_primary_score"] = rerank_debug.get("text_primary_score") | |
| 1038 | - debug_entry["text_support_score"] = rerank_debug.get("text_support_score") | |
| 1039 | - debug_entry["text_score_fallback_to_es"] = rerank_debug.get("text_score_fallback_to_es") | |
| 1040 | 1035 | debug_entry["knn_score"] = rerank_debug.get("knn_score") |
| 1036 | + debug_entry["fusion_inputs"] = rerank_debug.get("fusion_inputs") | |
| 1037 | + debug_entry["fusion_factors"] = rerank_debug.get("fusion_factors") | |
| 1038 | + debug_entry["fusion_summary"] = rerank_debug.get("fusion_summary") | |
| 1041 | 1039 | debug_entry["rerank_factor"] = rerank_debug.get("rerank_factor") |
| 1042 | 1040 | debug_entry["fine_factor"] = rerank_debug.get("fine_factor") |
| 1043 | 1041 | debug_entry["text_factor"] = rerank_debug.get("text_factor") |
| ... | ... | @@ -1047,7 +1045,13 @@ class Searcher: |
| 1047 | 1045 | debug_entry["matched_queries"] = rerank_debug.get("matched_queries") |
| 1048 | 1046 | elif fine_debug: |
| 1049 | 1047 | debug_entry["doc_id"] = fine_debug.get("doc_id") |
| 1048 | + debug_entry["score"] = fine_debug.get("score") | |
| 1050 | 1049 | debug_entry["fine_score"] = fine_debug.get("fine_score") |
| 1050 | + debug_entry["text_score"] = fine_debug.get("text_score") | |
| 1051 | + debug_entry["knn_score"] = fine_debug.get("knn_score") | |
| 1052 | + debug_entry["fusion_inputs"] = fine_debug.get("fusion_inputs") | |
| 1053 | + debug_entry["fusion_factors"] = fine_debug.get("fusion_factors") | |
| 1054 | + debug_entry["fusion_summary"] = fine_debug.get("fusion_summary") | |
| 1051 | 1055 | debug_entry["rerank_input"] = fine_debug.get("rerank_input") |
| 1052 | 1056 | |
| 1053 | 1057 | initial_rank = initial_ranks_by_doc.get(str(doc_id)) if doc_id is not None else None |
| ... | ... | @@ -1081,17 +1085,32 @@ class Searcher: |
| 1081 | 1085 | "fine_rank": { |
| 1082 | 1086 | "rank": fine_rank, |
| 1083 | 1087 | "rank_change": _rank_change(coarse_rank, fine_rank), |
| 1084 | - "score": fine_debug.get("fine_score") if fine_debug else hit.get("_fine_score"), | |
| 1088 | + "score": ( | |
| 1089 | + fine_debug.get("score") | |
| 1090 | + if fine_debug and fine_debug.get("score") is not None | |
| 1091 | + else hit.get("_fine_fused_score", hit.get("_fine_score")) | |
| 1092 | + ), | |
| 1093 | + "fine_score": fine_debug.get("fine_score") if fine_debug else hit.get("_fine_score"), | |
| 1094 | + "text_score": fine_debug.get("text_score") if fine_debug else hit.get("_text_score"), | |
| 1095 | + "knn_score": fine_debug.get("knn_score") if fine_debug else hit.get("_knn_score"), | |
| 1096 | + "fusion_summary": fine_debug.get("fusion_summary") if fine_debug else None, | |
| 1097 | + "fusion_inputs": fine_debug.get("fusion_inputs") if fine_debug else None, | |
| 1098 | + "fusion_factors": fine_debug.get("fusion_factors") if fine_debug else None, | |
| 1085 | 1099 | "rerank_input": fine_debug.get("rerank_input") if fine_debug else None, |
| 1100 | + "signals": fine_debug, | |
| 1086 | 1101 | }, |
| 1087 | 1102 | "rerank": { |
| 1088 | 1103 | "rank": rerank_rank, |
| 1089 | 1104 | "rank_change": _rank_change(fine_rank, rerank_rank), |
| 1105 | + "score": rerank_debug.get("score") if rerank_debug else hit.get("_fused_score"), | |
| 1090 | 1106 | "rerank_score": rerank_debug.get("rerank_score") if rerank_debug else hit.get("_rerank_score"), |
| 1091 | 1107 | "fine_score": rerank_debug.get("fine_score") if rerank_debug else hit.get("_fine_score"), |
| 1092 | 1108 | "fused_score": rerank_debug.get("fused_score") if rerank_debug else hit.get("_fused_score"), |
| 1093 | 1109 | "text_score": rerank_debug.get("text_score") if rerank_debug else hit.get("_text_score"), |
| 1094 | 1110 | "knn_score": rerank_debug.get("knn_score") if rerank_debug else hit.get("_knn_score"), |
| 1111 | + "fusion_summary": rerank_debug.get("fusion_summary") if rerank_debug else None, | |
| 1112 | + "fusion_inputs": rerank_debug.get("fusion_inputs") if rerank_debug else None, | |
| 1113 | + "fusion_factors": rerank_debug.get("fusion_factors") if rerank_debug else None, | |
| 1095 | 1114 | "rerank_factor": rerank_debug.get("rerank_factor") if rerank_debug else None, |
| 1096 | 1115 | "fine_factor": rerank_debug.get("fine_factor") if rerank_debug else None, |
| 1097 | 1116 | "text_factor": rerank_debug.get("text_factor") if rerank_debug else None, | ... | ... |
tests/test_rerank_client.py
| 1 | 1 | from math import isclose |
| 2 | 2 | |
| 3 | 3 | from config.schema import RerankFusionConfig |
| 4 | -from search.rerank_client import fuse_scores_and_resort | |
| 4 | +from search.rerank_client import fuse_scores_and_resort, run_lightweight_rerank | |
| 5 | 5 | |
| 6 | 6 | |
| 7 | 7 | def test_fuse_scores_and_resort_aggregates_text_components_and_keeps_rerank_primary(): |
| ... | ... | @@ -204,3 +204,57 @@ def test_fuse_scores_and_resort_applies_knn_dismax_weights_and_tie_breaker(): |
| 204 | 204 | assert isclose(debug[0]["weighted_image_knn_score"], 0.5, rel_tol=1e-9) |
| 205 | 205 | assert isclose(debug[0]["knn_primary_score"], 0.8, rel_tol=1e-9) |
| 206 | 206 | assert isclose(debug[0]["knn_support_score"], 0.5, rel_tol=1e-9) |
| 207 | + | |
| 208 | + | |
| 209 | +def test_run_lightweight_rerank_sorts_by_fused_stage_score(monkeypatch): | |
| 210 | + hits = [ | |
| 211 | + { | |
| 212 | + "_id": "fine-raw-better", | |
| 213 | + "_score": 1.0, | |
| 214 | + "_source": {"title": {"en": "Alpha"}}, | |
| 215 | + "matched_queries": {"base_query": 0.5, "knn_query": 0.0}, | |
| 216 | + }, | |
| 217 | + { | |
| 218 | + "_id": "fusion-better", | |
| 219 | + "_score": 1.0, | |
| 220 | + "_source": {"title": {"en": "Beta"}}, | |
| 221 | + "matched_queries": {"base_query": 40.0, "knn_query": 0.0}, | |
| 222 | + }, | |
| 223 | + ] | |
| 224 | + | |
| 225 | + monkeypatch.setattr( | |
| 226 | + "search.rerank_client.call_rerank_service", | |
| 227 | + lambda *args, **kwargs: ([0.9, 0.8], {"model": "fine-bge"}), | |
| 228 | + ) | |
| 229 | + | |
| 230 | + scores, meta, debug_rows = run_lightweight_rerank( | |
| 231 | + query="toy", | |
| 232 | + es_hits=hits, | |
| 233 | + language="en", | |
| 234 | + debug=True, | |
| 235 | + ) | |
| 236 | + | |
| 237 | + assert scores == [0.9, 0.8] | |
| 238 | + assert meta == {"model": "fine-bge"} | |
| 239 | + assert [hit["_id"] for hit in hits] == ["fusion-better", "fine-raw-better"] | |
| 240 | + assert hits[0]["_fine_fused_score"] > hits[1]["_fine_fused_score"] | |
| 241 | + assert debug_rows[0]["fusion_summary"] | |
| 242 | + assert "fine_score=" in debug_rows[0]["fusion_summary"] | |
| 243 | + assert "text_score=" in debug_rows[0]["fusion_summary"] | |
| 244 | + | |
| 245 | + | |
| 246 | +def test_fuse_scores_and_resort_uses_hit_level_fine_score_when_not_passed_separately(): | |
| 247 | + hits = [ | |
| 248 | + { | |
| 249 | + "_id": "with-fine", | |
| 250 | + "_score": 1.0, | |
| 251 | + "_fine_score": 0.7, | |
| 252 | + "matched_queries": {"base_query": 2.0, "knn_query": 0.5}, | |
| 253 | + } | |
| 254 | + ] | |
| 255 | + | |
| 256 | + debug = fuse_scores_and_resort(hits, [0.8], debug=True) | |
| 257 | + | |
| 258 | + assert isclose(debug[0]["fine_factor"], (0.7 + 0.00001), rel_tol=1e-9) | |
| 259 | + assert debug[0]["fusion_inputs"]["fine_score"] == 0.7 | |
| 260 | + assert "fine_score=" in debug[0]["fusion_summary"] | ... | ... |