# Reranker 模块 **请求示例**见 `docs/QUICKSTART.md` §3.5。扩展规范见 `docs/DEVELOPER_GUIDE.md` §7。部署与调优实战见 `reranker/DEPLOYMENT_AND_TUNING.md`。`ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF` 的专项接入与调优结论见 `reranker/GGUF_0_6B_INSTALL_AND_TUNING.md`。 --- Reranker 服务提供统一的 `/rerank` API,支持可插拔后端(BGE、Qwen3-vLLM、Qwen3-Transformers、Qwen3-GGUF、DashScope 云重排)。调用方通过 HTTP 访问,不关心具体后端。 **特性** - 多后端:`qwen3_vllm`、`qwen3_vllm_score`(同模型,vLLM ``LLM.score()`` + 独立 `.venv-reranker-score`)、`qwen3_transformers`、`qwen3_transformers_packed`(共享前缀 + packed attention mask)、`qwen3_gguf`(Qwen3-Reranker-4B GGUF + llama.cpp)、`qwen3_gguf_06b`(Qwen3-Reranker-0.6B Q8_0 GGUF + llama.cpp)、`bge`(兼容保留) - 云后端:`dashscope_rerank`(调用 DashScope `/compatible-api/v1/reranks`,支持按地域切换 endpoint) - 统一配置:`config/config.yaml` → `services.rerank.backend` / `services.rerank.backends.` - 文档去重、分数与输入顺序一致、FP16/GPU 支持(视后端) ## 目录与入口 - `reranker/server.py`:FastAPI 服务,启动时按配置加载一个后端 - `reranker/backends/`:后端实现与工厂 - `backends/__init__.py`:`get_rerank_backend(name, config)` - `backends/bge.py`:BGE 后端 - `backends/qwen3_vllm.py`:Qwen3-Reranker-0.6B + vLLM(generate + logprobs) - `backends/qwen3_vllm_score.py`:同上模型 + vLLM ``LLM.score()``(`requirements_reranker_qwen3_vllm_score.txt` / `.venv-reranker-score`) - `backends/qwen3_transformers.py`:Qwen3-Reranker-0.6B 纯 Transformers 后端(官方 Usage 方式) - `backends/qwen3_transformers_packed.py`:Qwen3-Reranker-0.6B + Transformers packed 推理(共享 query prefix,适合 `1 query + 400 docs`) - `backends/qwen3_gguf.py`:Qwen3-Reranker GGUF + llama.cpp 后端(支持 `qwen3_gguf` / `qwen3_gguf_06b`) - `backends/dashscope_rerank.py`:DashScope 云重排后端(HTTP 调用) - `reranker/bge_reranker.py`:BGE 核心推理(被 bge 后端封装) - `reranker/config.py`:服务端口、MAX_DOCS、NORMALIZE 等(后端参数在 config.yaml) ## 依赖 - 通用:`torch`、`transformers`、`fastapi`、`uvicorn`(隔离环境见 `requirements_reranker_service.txt`;全量 ML 环境另见 `requirements_ml.txt`) - **Qwen3-vLLM 后端**:`vllm>=0.8.5`、`transformers>=4.51.0`(`qwen3_vllm` → `.venv-reranker`) - **Qwen3-vLLM-score 后端**:固定 `vllm==0.18.0`(`qwen3_vllm_score` → `.venv-reranker-score`,见 `requirements_reranker_qwen3_vllm_score.txt`) - **Qwen3-Transformers 后端**:`transformers>=4.51.0`、`torch`(无需 vLLM,适合 CPU 或小显存) - **Qwen3-Transformers-Packed 后端**:复用 Transformers 依赖(`qwen3_transformers_packed` → `.venv-reranker-transformers-packed`) - **Qwen3-GGUF 后端**:`llama-cpp-python>=0.3.16` - 现在按 backend 使用独立 venv: - `qwen3_vllm` -> `.venv-reranker` - `qwen3_vllm_score` -> `.venv-reranker-score` - `qwen3_gguf` -> `.venv-reranker-gguf` - `qwen3_gguf_06b` -> `.venv-reranker-gguf-06b` - `qwen3_transformers` -> `.venv-reranker-transformers` - `qwen3_transformers_packed` -> `.venv-reranker-transformers-packed` - `bge` -> `.venv-reranker-bge` - `dashscope_rerank` -> `.venv-reranker-dashscope` ```bash ./scripts/setup_reranker_venv.sh qwen3_gguf_06b ``` CUDA 构建建议: ```bash PATH=/usr/local/cuda/bin:$PATH \ CUDACXX=/usr/local/cuda/bin/nvcc \ CMAKE_ARGS="-DGGML_CUDA=on" \ FORCE_CMAKE=1 \ ./.venv-reranker-gguf/bin/pip install --no-cache-dir --force-reinstall --no-build-isolation llama-cpp-python==0.3.18 ``` ## 配置 - **后端选择**:`config/config.yaml` 中 `services.rerank.backend`(`qwen3_vllm` | `qwen3_vllm_score` | `qwen3_transformers` | `qwen3_transformers_packed` | `qwen3_gguf` | `qwen3_gguf_06b` | `bge` | `dashscope_rerank`),或环境变量 `RERANK_BACKEND`。 - **后端参数**:`services.rerank.backends.bge` / `services.rerank.backends.qwen3_vllm`,例如: ```yaml services: rerank: backend: "qwen3_gguf" # 或 qwen3_vllm / bge backends: bge: model_name: "BAAI/bge-reranker-v2-m3" device: null use_fp16: true batch_size: 64 max_length: 512 cache_dir: "./model_cache" enable_warmup: true qwen3_vllm: model_name: "Qwen/Qwen3-Reranker-0.6B" max_model_len: 256 infer_batch_size: 64 sort_by_doc_length: true enable_prefix_caching: true enforce_eager: false instruction: "Given a shopping query, rank product titles by relevance" qwen3_transformers: model_name: "Qwen/Qwen3-Reranker-0.6B" instruction: "Given a shopping query, rank product titles by relevance" max_length: 8192 batch_size: 64 use_fp16: true tensor_parallel_size: 1 gpu_memory_utilization: 0.8 instruction: "Given a shopping query, rank product titles by relevance" qwen3_transformers_packed: model_name: "Qwen/Qwen3-Reranker-0.6B" instruction: "Rank products by query with category & style match prioritized" max_model_len: 4096 max_doc_len: 160 max_docs_per_pack: 0 use_fp16: true sort_by_doc_length: true attn_implementation: "eager" qwen3_gguf: repo_id: "DevQuasar/Qwen.Qwen3-Reranker-4B-GGUF" filename: "*Q8_0.gguf" local_dir: "./models/reranker/qwen3-reranker-4b-gguf" cache_dir: "./model_cache" instruction: "Rank products by query with category & style match prioritized" n_ctx: 384 n_batch: 384 n_ubatch: 128 n_gpu_layers: 24 flash_attn: true offload_kqv: true infer_batch_size: 8 sort_by_doc_length: true length_sort_mode: "char" qwen3_gguf_06b: repo_id: "ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF" filename: "qwen3-reranker-0.6b-q8_0.gguf" local_dir: "./models/reranker/qwen3-reranker-0.6b-q8_0-gguf" cache_dir: "./model_cache" instruction: "Rank products by query with category & style match prioritized" n_ctx: 256 n_batch: 256 n_ubatch: 256 n_gpu_layers: 999 infer_batch_size: 32 sort_by_doc_length: true length_sort_mode: "char" reuse_query_state: false dashscope_rerank: model_name: "qwen3-rerank" endpoint: "https://dashscope.aliyuncs.com/compatible-api/v1/reranks" api_key_env: "RERANK_DASHSCOPE_API_KEY_CN" timeout_sec: 15.0 top_n_cap: 0 batchsize: 64 # 0关闭;>0并发小包调度(top_n/top_n_cap 仍生效,分包后全局截断) instruct: "Given a shopping query, rank product titles by relevance" max_retries: 2 retry_backoff_sec: 0.2 ``` DashScope endpoint 地域示例: - 中国:`https://dashscope.aliyuncs.com/compatible-api/v1/reranks` - 新加坡:`https://dashscope-intl.aliyuncs.com/compatible-api/v1/reranks` - 美国:`https://dashscope-us.aliyuncs.com/compatible-api/v1/reranks` DashScope 认证: - `api_key_env` 必填,表示该后端读取哪个环境变量作为 API Key - 推荐按地域分别注入: - `RERANK_DASHSCOPE_API_KEY_CN=...` - `RERANK_DASHSCOPE_API_KEY_US=...` - 服务端口、请求限制等仍在 `reranker/config.py`(或环境变量 `RERANKER_PORT`、`RERANKER_HOST`)。 ## 运行 ```bash ./scripts/start_reranker.sh ``` 该脚本会按当前 `services.rerank.backend` 自动选择对应的独立 venv;首次请先执行 `./scripts/setup_reranker_venv.sh `。 ## 性能压测(1000 docs) ```bash ./scripts/benchmark_reranker_1000docs.sh ``` 输出目录:`perf_reports//reranker_1000docs/`。 ## API ### Health ``` GET /health ``` Response 含 `backend`(当前后端名)、`model`、`model_loaded`、`status`。 ### Rerank ``` POST /rerank Content-Type: application/json { "query": "wireless mouse", "docs": ["logitech mx master", "usb cable", "wireless mouse bluetooth"], "top_n": 10 } ``` `top_n` 为可选字段: - 对本地后端(`qwen3_vllm` / `qwen3_transformers` / `qwen3_transformers_packed` / `qwen3_gguf` / `qwen3_gguf_06b` / `bge`)通常会忽略,仍返回全量分数。 - 对 `dashscope_rerank` 可用于控制云端返回的候选量,建议设置为 `page+size`(例如分页 `from=20,size=10` 时传 `30`)。 Response: ``` { "scores": [0.93, 0.02, 0.88], "meta": { "input_docs": 3, "usable_docs": 3, "unique_docs": 3, "dedup_ratio": 0.0, "elapsed_ms": 12.4, "model": "BAAI/bge-reranker-v2-m3", "device": "cuda", "fp16": true, "batch_size": 64, "max_length": 512, "normalize": true, "service_elapsed_ms": 13.1 } } ``` ## Logging The service uses standard Python logging. For structured logs and full output, run uvicorn with: ```bash uvicorn reranker.server:app --host 0.0.0.0 --port 6007 --log-level info ``` ## Notes - 无请求级缓存;输入按字符串去重后推理,再按原始顺序回填分数。 - 空或 null 的 doc 跳过并计为 0。 - **Qwen3-vLLM 分批策略**:`docs` 请求体可为 1000+,服务端会按 `infer_batch_size` 拆分;当 `sort_by_doc_length=true` 时,会先按文档长度排序后分批,减少 padding 开销,最终再按输入顺序回填分数。 - 运行时可用环境变量临时覆盖批量参数:`RERANK_VLLM_INFER_BATCH_SIZE`、`RERANK_VLLM_SORT_BY_DOC_LENGTH`。 - **Qwen3-vLLM**:参考 [Qwen3-Reranker-0.6B](https://huggingface.co/Qwen/Qwen3-Reranker-0.6B),需 GPU 与较多显存;与 BGE 相比适合长文本、高吞吐场景(vLLM 前缀缓存)。 - **Qwen3-Transformers**:官方 Transformers Usage 方式,无需 vLLM;适合 CPU 或小显存。默认 `attn_implementation: "sdpa"`;若已安装 `flash_attn` 可设 `flash_attention_2`(未安装时服务会自动回退到 sdpa)。 - **Qwen3-Transformers-Packed**:仍使用 Hugging Face Transformers 与 PyTorch CUDA 内核,只定制 packed 输入、`position_ids` 和 4D `attention_mask`。它更适合在线检索里的“一个 query 对几百个短 doc”场景;默认 `attn_implementation: "eager"` 以保证自定义 mask 兼容性,若你的 `torch/transformers` 版本已验证支持,可再压测 `"sdpa"`。 - **Qwen3-GGUF**:参考 [DevQuasar/Qwen.Qwen3-Reranker-4B-GGUF](https://huggingface.co/DevQuasar/Qwen.Qwen3-Reranker-4B-GGUF)。单卡 T4 且仅剩约 `4.8~6GB` 显存时,推荐 `Q8_0 + n_ctx=384 + n_gpu_layers=24 + flash_attn=true + offload_kqv=true` 起步;若启动 OOM,优先把 `n_gpu_layers` 下调到 `20`,再把 `n_ctx` 下调到 `320`。`infer_batch_size` 在 GGUF 后端是服务侧 work chunk,大多不如 `n_gpu_layers` / `n_ctx` 关键。 - **Qwen3-GGUF-0.6B**:参考 [ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF](https://huggingface.co/ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF)。它的优点是权重小、显存占用低,单进程实测约 `0.9~1.1 GiB`;但在当前 llama.cpp 串行打分接法下,`1 query + 400 titles` 的实测延迟仍约 `265s`。因此它更适合低显存功能后备,不适合作为在线低延迟主 reranker。