README.md
Reranker 模块
请求示例见 docs/QUICKSTART.md §3.5。扩展规范见 docs/DEVELOPER_GUIDE.md §7。部署与调优实战见 reranker/DEPLOYMENT_AND_TUNING.md。ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF 的专项接入与调优结论见 reranker/GGUF_0_6B_INSTALL_AND_TUNING.md。
Reranker 服务提供统一的 /rerank API,支持可插拔后端(BGE、Qwen3-vLLM、Qwen3-Transformers、Qwen3-GGUF、DashScope 云重排)。调用方通过 HTTP 访问,不关心具体后端。
特性
- 多后端:
qwen3_vllm、qwen3_vllm_score(同模型,vLLMLLM.score()+ 独立.venv-reranker-score)、qwen3_transformers、qwen3_transformers_packed(共享前缀 + packed attention mask)、qwen3_gguf(Qwen3-Reranker-4B GGUF + llama.cpp)、qwen3_gguf_06b(Qwen3-Reranker-0.6B Q8_0 GGUF + llama.cpp)、bge(兼容保留) - 云后端:
dashscope_rerank(调用 DashScope/compatible-api/v1/reranks,支持按地域切换 endpoint) - 统一配置:
config/config.yaml→services.rerank.backend/services.rerank.backends.<name> - 文档去重、分数与输入顺序一致、FP16/GPU 支持(视后端)
目录与入口
reranker/server.py:FastAPI 服务,启动时按配置加载一个后端reranker/backends/:后端实现与工厂backends/__init__.py:get_rerank_backend(name, config)backends/bge.py:BGE 后端backends/qwen3_vllm.py:Qwen3-Reranker-0.6B + vLLM(generate + logprobs)backends/qwen3_vllm_score.py:同上模型 + vLLMLLM.score()(requirements_reranker_qwen3_vllm_score.txt/.venv-reranker-score)backends/qwen3_transformers.py:Qwen3-Reranker-0.6B 纯 Transformers 后端(官方 Usage 方式)backends/qwen3_transformers_packed.py:Qwen3-Reranker-0.6B + Transformers packed 推理(共享 query prefix,适合1 query + 400 docs)backends/qwen3_gguf.py:Qwen3-Reranker GGUF + llama.cpp 后端(支持qwen3_gguf/qwen3_gguf_06b)backends/dashscope_rerank.py:DashScope 云重排后端(HTTP 调用)
reranker/bge_reranker.py:BGE 核心推理(被 bge 后端封装)reranker/config.py:服务端口、MAX_DOCS、NORMALIZE 等(后端参数在 config.yaml)
依赖
- 通用:
torch、transformers、fastapi、uvicorn(隔离环境见requirements_reranker_service.txt;全量 ML 环境另见requirements_ml.txt) - Qwen3-vLLM 后端:
vllm>=0.8.5、transformers>=4.51.0(qwen3_vllm→.venv-reranker) - Qwen3-vLLM-score 后端:固定
vllm==0.18.0(qwen3_vllm_score→.venv-reranker-score,见requirements_reranker_qwen3_vllm_score.txt) - Qwen3-Transformers 后端:
transformers>=4.51.0、torch(无需 vLLM,适合 CPU 或小显存) - Qwen3-Transformers-Packed 后端:复用 Transformers 依赖(
qwen3_transformers_packed→.venv-reranker-transformers-packed) - Qwen3-GGUF 后端:
llama-cpp-python>=0.3.16 - 现在按 backend 使用独立 venv:
qwen3_vllm->.venv-rerankerqwen3_vllm_score->.venv-reranker-scoreqwen3_gguf->.venv-reranker-ggufqwen3_gguf_06b->.venv-reranker-gguf-06bqwen3_transformers->.venv-reranker-transformersqwen3_transformers_packed->.venv-reranker-transformers-packedbge->.venv-reranker-bgedashscope_rerank->.venv-reranker-dashscopebash ./scripts/setup_reranker_venv.sh qwen3_gguf_06bCUDA 构建建议:bash PATH=/usr/local/cuda/bin:$PATH \ CUDACXX=/usr/local/cuda/bin/nvcc \ CMAKE_ARGS="-DGGML_CUDA=on" \ FORCE_CMAKE=1 \ ./.venv-reranker-gguf/bin/pip install --no-cache-dir --force-reinstall --no-build-isolation llama-cpp-python==0.3.18
配置
- 后端选择:
config/config.yaml中services.rerank.backend(qwen3_vllm|qwen3_vllm_score|qwen3_transformers|qwen3_transformers_packed|qwen3_gguf|qwen3_gguf_06b|bge|dashscope_rerank),或环境变量RERANK_BACKEND。 - 后端参数:
services.rerank.backends.bge/services.rerank.backends.qwen3_vllm,例如:
services:
rerank:
backend: "qwen3_gguf" # 或 qwen3_vllm / bge
backends:
bge:
model_name: "BAAI/bge-reranker-v2-m3"
device: null
use_fp16: true
batch_size: 64
max_length: 512
cache_dir: "./model_cache"
enable_warmup: true
qwen3_vllm:
model_name: "Qwen/Qwen3-Reranker-0.6B"
max_model_len: 256
infer_batch_size: 64
sort_by_doc_length: true
enable_prefix_caching: true
enforce_eager: false
instruction: "Given a shopping query, rank product titles by relevance"
qwen3_transformers:
model_name: "Qwen/Qwen3-Reranker-0.6B"
instruction: "Given a shopping query, rank product titles by relevance"
max_length: 8192
batch_size: 64
use_fp16: true
tensor_parallel_size: 1
gpu_memory_utilization: 0.8
instruction: "Given a shopping query, rank product titles by relevance"
qwen3_transformers_packed:
model_name: "Qwen/Qwen3-Reranker-0.6B"
instruction: "Rank products by query with category & style match prioritized"
max_model_len: 4096
max_doc_len: 160
max_docs_per_pack: 0
use_fp16: true
sort_by_doc_length: true
attn_implementation: "eager"
qwen3_gguf:
repo_id: "DevQuasar/Qwen.Qwen3-Reranker-4B-GGUF"
filename: "*Q8_0.gguf"
local_dir: "./models/reranker/qwen3-reranker-4b-gguf"
cache_dir: "./model_cache"
instruction: "Rank products by query with category & style match prioritized"
n_ctx: 384
n_batch: 384
n_ubatch: 128
n_gpu_layers: 24
flash_attn: true
offload_kqv: true
infer_batch_size: 8
sort_by_doc_length: true
length_sort_mode: "char"
qwen3_gguf_06b:
repo_id: "ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF"
filename: "qwen3-reranker-0.6b-q8_0.gguf"
local_dir: "./models/reranker/qwen3-reranker-0.6b-q8_0-gguf"
cache_dir: "./model_cache"
instruction: "Rank products by query with category & style match prioritized"
n_ctx: 256
n_batch: 256
n_ubatch: 256
n_gpu_layers: 999
infer_batch_size: 32
sort_by_doc_length: true
length_sort_mode: "char"
reuse_query_state: false
dashscope_rerank:
model_name: "qwen3-rerank"
endpoint: "https://dashscope.aliyuncs.com/compatible-api/v1/reranks"
api_key_env: "RERANK_DASHSCOPE_API_KEY_CN"
timeout_sec: 15.0
top_n_cap: 0
batchsize: 64 # 0关闭;>0并发小包调度(top_n/top_n_cap 仍生效,分包后全局截断)
instruct: "Given a shopping query, rank product titles by relevance"
max_retries: 2
retry_backoff_sec: 0.2
DashScope endpoint 地域示例:
- 中国:
https://dashscope.aliyuncs.com/compatible-api/v1/reranks - 新加坡:
https://dashscope-intl.aliyuncs.com/compatible-api/v1/reranks - 美国:
https://dashscope-us.aliyuncs.com/compatible-api/v1/reranks
DashScope 认证:
api_key_env必填,表示该后端读取哪个环境变量作为 API Key推荐按地域分别注入:
RERANK_DASHSCOPE_API_KEY_CN=...RERANK_DASHSCOPE_API_KEY_US=...
服务端口、请求限制等仍在
reranker/config.py(或环境变量RERANKER_PORT、RERANKER_HOST)。
运行
./scripts/start_reranker.sh
该脚本会按当前 services.rerank.backend 自动选择对应的独立 venv;首次请先执行 ./scripts/setup_reranker_venv.sh <backend>。
性能压测(1000 docs)
./scripts/benchmark_reranker_1000docs.sh
输出目录:perf_reports/<date>/reranker_1000docs/。
API
Health
GET /health
Response 含 backend(当前后端名)、model、model_loaded、status。
Rerank
POST /rerank
Content-Type: application/json
{
"query": "wireless mouse",
"docs": ["logitech mx master", "usb cable", "wireless mouse bluetooth"],
"top_n": 10
}
top_n 为可选字段:
- 对本地后端(
qwen3_vllm/qwen3_transformers/qwen3_transformers_packed/qwen3_gguf/qwen3_gguf_06b/bge)通常会忽略,仍返回全量分数。 - 对
dashscope_rerank可用于控制云端返回的候选量,建议设置为page+size(例如分页from=20,size=10时传30)。
Response:
{
"scores": [0.93, 0.02, 0.88],
"meta": {
"input_docs": 3,
"usable_docs": 3,
"unique_docs": 3,
"dedup_ratio": 0.0,
"elapsed_ms": 12.4,
"model": "BAAI/bge-reranker-v2-m3",
"device": "cuda",
"fp16": true,
"batch_size": 64,
"max_length": 512,
"normalize": true,
"service_elapsed_ms": 13.1
}
}
Logging
The service uses standard Python logging. For structured logs and full output, run uvicorn with:
uvicorn reranker.server:app --host 0.0.0.0 --port 6007 --log-level info
Notes
- 无请求级缓存;输入按字符串去重后推理,再按原始顺序回填分数。
- 空或 null 的 doc 跳过并计为 0。
- Qwen3-vLLM 分批策略:
docs请求体可为 1000+,服务端会按infer_batch_size拆分;当sort_by_doc_length=true时,会先按文档长度排序后分批,减少 padding 开销,最终再按输入顺序回填分数。 - 运行时可用环境变量临时覆盖批量参数:
RERANK_VLLM_INFER_BATCH_SIZE、RERANK_VLLM_SORT_BY_DOC_LENGTH。 - Qwen3-vLLM:参考 Qwen3-Reranker-0.6B,需 GPU 与较多显存;与 BGE 相比适合长文本、高吞吐场景(vLLM 前缀缓存)。
- Qwen3-Transformers:官方 Transformers Usage 方式,无需 vLLM;适合 CPU 或小显存。默认
attn_implementation: "sdpa";若已安装flash_attn可设flash_attention_2(未安装时服务会自动回退到 sdpa)。 - Qwen3-Transformers-Packed:仍使用 Hugging Face Transformers 与 PyTorch CUDA 内核,只定制 packed 输入、
position_ids和 4Dattention_mask。它更适合在线检索里的“一个 query 对几百个短 doc”场景;默认attn_implementation: "eager"以保证自定义 mask 兼容性,若你的torch/transformers版本已验证支持,可再压测"sdpa"。 - Qwen3-GGUF:参考 DevQuasar/Qwen.Qwen3-Reranker-4B-GGUF。单卡 T4 且仅剩约
4.8~6GB显存时,推荐Q8_0 + n_ctx=384 + n_gpu_layers=24 + flash_attn=true + offload_kqv=true起步;若启动 OOM,优先把n_gpu_layers下调到20,再把n_ctx下调到320。infer_batch_size在 GGUF 后端是服务侧 work chunk,大多不如n_gpu_layers/n_ctx关键。 - Qwen3-GGUF-0.6B:参考 ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF。它的优点是权重小、显存占用低,单进程实测约
0.9~1.1 GiB;但在当前 llama.cpp 串行打分接法下,1 query + 400 titles的实测延迟仍约265s。因此它更适合低显存功能后备,不适合作为在线低延迟主 reranker。