Blame view

reranker/README.md 11.1 KB
42e3aea6   tangwang   tidy
1
2
  # Reranker 模块
  
5c21a485   tangwang   qwen3-reranker-0....
3
  **请求示例**`docs/QUICKSTART.md` §3.5。扩展规范见 `docs/DEVELOPER_GUIDE.md` §7。部署与调优实战见 `reranker/DEPLOYMENT_AND_TUNING.md`。`ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF` 的专项接入与调优结论见 `reranker/GGUF_0_6B_INSTALL_AND_TUNING.md`
42e3aea6   tangwang   tidy
4
5
  
  ---
d90e7428   tangwang   补充重排
6
  
3d508beb   tangwang   reranker-4b-gguf
7
  Reranker 服务提供统一的 `/rerank` API,支持可插拔后端(BGE、Qwen3-vLLM、Qwen3-Transformers、Qwen3-GGUF、DashScope 云重排)。调用方通过 HTTP 访问,不关心具体后端。
701ae503   tangwang   docs
8
9
  
  **特性**
4823f463   tangwang   qwen3_vllm_score ...
10
  - 多后端:`qwen3_vllm`、`qwen3_vllm_score`(同模型,vLLM ``LLM.score()`` + 独立 `.venv-reranker-score`)、`qwen3_transformers`、`qwen3_transformers_packed`(共享前缀 + packed attention mask)、`qwen3_gguf`(Qwen3-Reranker-4B GGUF + llama.cpp)、`qwen3_gguf_06b`(Qwen3-Reranker-0.6B Q8_0 GGUF + llama.cpp)、`bge`(兼容保留)
d31c7f65   tangwang   补充云服务reranker
11
  - 云后端:`dashscope_rerank`(调用 DashScope `/compatible-api/v1/reranks`,支持按地域切换 endpoint)
701ae503   tangwang   docs
12
13
14
15
16
17
18
19
  - 统一配置:`config/config.yaml` → `services.rerank.backend` / `services.rerank.backends.<name>`
  - 文档去重、分数与输入顺序一致、FP16/GPU 支持(视后端)
  
  ## 目录与入口
  - `reranker/server.py`:FastAPI 服务,启动时按配置加载一个后端
  - `reranker/backends/`:后端实现与工厂
    - `backends/__init__.py`:`get_rerank_backend(name, config)`
    - `backends/bge.py`:BGE 后端
4823f463   tangwang   qwen3_vllm_score ...
20
21
    - `backends/qwen3_vllm.py`:Qwen3-Reranker-0.6B + vLLM(generate + logprobs)
    - `backends/qwen3_vllm_score.py`:同上模型 + vLLM ``LLM.score()``(`requirements_reranker_qwen3_vllm_score.txt` / `.venv-reranker-score`
80955935   tangwang   Reranker 补充 qwen3...
22
    - `backends/qwen3_transformers.py`:Qwen3-Reranker-0.6B 纯 Transformers 后端(官方 Usage 方式)
4823f463   tangwang   qwen3_vllm_score ...
23
    - `backends/qwen3_transformers_packed.py`:Qwen3-Reranker-0.6B + Transformers packed 推理(共享 query prefix,适合 `1 query + 400 docs`
5c21a485   tangwang   qwen3-reranker-0....
24
    - `backends/qwen3_gguf.py`:Qwen3-Reranker GGUF + llama.cpp 后端(支持 `qwen3_gguf` / `qwen3_gguf_06b`
d31c7f65   tangwang   补充云服务reranker
25
    - `backends/dashscope_rerank.py`:DashScope 云重排后端(HTTP 调用)
701ae503   tangwang   docs
26
27
28
29
  - `reranker/bge_reranker.py`:BGE 核心推理(被 bge 后端封装)
  - `reranker/config.py`:服务端口、MAX_DOCS、NORMALIZE 等(后端参数在 config.yaml)
  
  ## 依赖
fb973d19   tangwang   configs
30
  - 通用:`torch`、`transformers`、`fastapi`、`uvicorn`(隔离环境见 `requirements_reranker_service.txt`;全量 ML 环境另见 `requirements_ml.txt`
4823f463   tangwang   qwen3_vllm_score ...
31
32
  - **Qwen3-vLLM 后端**`vllm>=0.8.5`、`transformers>=4.51.0`(`qwen3_vllm` → `.venv-reranker`
  - **Qwen3-vLLM-score 后端**:固定 `vllm==0.18.0`(`qwen3_vllm_score` → `.venv-reranker-score`,见 `requirements_reranker_qwen3_vllm_score.txt`
80955935   tangwang   Reranker 补充 qwen3...
33
  - **Qwen3-Transformers 后端**`transformers>=4.51.0`、`torch`(无需 vLLM,适合 CPU 或小显存)
4823f463   tangwang   qwen3_vllm_score ...
34
  - **Qwen3-Transformers-Packed 后端**:复用 Transformers 依赖(`qwen3_transformers_packed` → `.venv-reranker-transformers-packed`
3d508beb   tangwang   reranker-4b-gguf
35
36
37
  - **Qwen3-GGUF 后端**`llama-cpp-python>=0.3.16`
  - 现在按 backend 使用独立 venv:
    - `qwen3_vllm` -> `.venv-reranker`
4823f463   tangwang   qwen3_vllm_score ...
38
    - `qwen3_vllm_score` -> `.venv-reranker-score`
3d508beb   tangwang   reranker-4b-gguf
39
    - `qwen3_gguf` -> `.venv-reranker-gguf`
5c21a485   tangwang   qwen3-reranker-0....
40
    - `qwen3_gguf_06b` -> `.venv-reranker-gguf-06b`
3d508beb   tangwang   reranker-4b-gguf
41
    - `qwen3_transformers` -> `.venv-reranker-transformers`
4823f463   tangwang   qwen3_vllm_score ...
42
    - `qwen3_transformers_packed` -> `.venv-reranker-transformers-packed`
3d508beb   tangwang   reranker-4b-gguf
43
44
    - `bge` -> `.venv-reranker-bge`
    - `dashscope_rerank` -> `.venv-reranker-dashscope`
701ae503   tangwang   docs
45
    ```bash
5c21a485   tangwang   qwen3-reranker-0....
46
    ./scripts/setup_reranker_venv.sh qwen3_gguf_06b
3d508beb   tangwang   reranker-4b-gguf
47
48
49
50
51
52
53
54
    ```
    CUDA 构建建议:
    ```bash
    PATH=/usr/local/cuda/bin:$PATH \
    CUDACXX=/usr/local/cuda/bin/nvcc \
    CMAKE_ARGS="-DGGML_CUDA=on" \
    FORCE_CMAKE=1 \
    ./.venv-reranker-gguf/bin/pip install --no-cache-dir --force-reinstall --no-build-isolation llama-cpp-python==0.3.18
701ae503   tangwang   docs
55
56
57
    ```
  
  ## 配置
4823f463   tangwang   qwen3_vllm_score ...
58
  - **后端选择**`config/config.yaml` 中 `services.rerank.backend`(`qwen3_vllm` | `qwen3_vllm_score` | `qwen3_transformers` | `qwen3_transformers_packed` | `qwen3_gguf` | `qwen3_gguf_06b` | `bge` | `dashscope_rerank`),或环境变量 `RERANK_BACKEND`
701ae503   tangwang   docs
59
60
61
62
63
  - **后端参数**`services.rerank.backends.bge` / `services.rerank.backends.qwen3_vllm`,例如:
  
  ```yaml
  services:
    rerank:
3d508beb   tangwang   reranker-4b-gguf
64
      backend: "qwen3_gguf"   # 或 qwen3_vllm / bge
701ae503   tangwang   docs
65
66
67
68
69
70
71
72
73
74
75
      backends:
        bge:
          model_name: "BAAI/bge-reranker-v2-m3"
          device: null
          use_fp16: true
          batch_size: 64
          max_length: 512
          cache_dir: "./model_cache"
          enable_warmup: true
        qwen3_vllm:
          model_name: "Qwen/Qwen3-Reranker-0.6B"
9f5994b4   tangwang   reranker
76
77
78
          max_model_len: 256
          infer_batch_size: 64
          sort_by_doc_length: true
9f5994b4   tangwang   reranker
79
80
          enable_prefix_caching: true
          enforce_eager: false
a99e62ba   tangwang   记录各阶段耗时
81
          instruction: "Given a shopping query, rank product titles by relevance"
80955935   tangwang   Reranker 补充 qwen3...
82
83
        qwen3_transformers:
          model_name: "Qwen/Qwen3-Reranker-0.6B"
a99e62ba   tangwang   记录各阶段耗时
84
          instruction: "Given a shopping query, rank product titles by relevance"
80955935   tangwang   Reranker 补充 qwen3...
85
86
87
          max_length: 8192
          batch_size: 64
          use_fp16: true
701ae503   tangwang   docs
88
89
          tensor_parallel_size: 1
          gpu_memory_utilization: 0.8
a99e62ba   tangwang   记录各阶段耗时
90
          instruction: "Given a shopping query, rank product titles by relevance"
4823f463   tangwang   qwen3_vllm_score ...
91
92
93
94
95
96
97
98
99
        qwen3_transformers_packed:
          model_name: "Qwen/Qwen3-Reranker-0.6B"
          instruction: "Rank products by query with category & style match prioritized"
          max_model_len: 4096
          max_doc_len: 160
          max_docs_per_pack: 0
          use_fp16: true
          sort_by_doc_length: true
          attn_implementation: "eager"
3d508beb   tangwang   reranker-4b-gguf
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
        qwen3_gguf:
          repo_id: "DevQuasar/Qwen.Qwen3-Reranker-4B-GGUF"
          filename: "*Q8_0.gguf"
          local_dir: "./models/reranker/qwen3-reranker-4b-gguf"
          cache_dir: "./model_cache"
          instruction: "Rank products by query with category & style match prioritized"
          n_ctx: 384
          n_batch: 384
          n_ubatch: 128
          n_gpu_layers: 24
          flash_attn: true
          offload_kqv: true
          infer_batch_size: 8
          sort_by_doc_length: true
          length_sort_mode: "char"
5c21a485   tangwang   qwen3-reranker-0....
115
116
117
118
119
120
121
122
123
124
125
126
127
128
        qwen3_gguf_06b:
          repo_id: "ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF"
          filename: "qwen3-reranker-0.6b-q8_0.gguf"
          local_dir: "./models/reranker/qwen3-reranker-0.6b-q8_0-gguf"
          cache_dir: "./model_cache"
          instruction: "Rank products by query with category & style match prioritized"
          n_ctx: 256
          n_batch: 256
          n_ubatch: 256
          n_gpu_layers: 999
          infer_batch_size: 32
          sort_by_doc_length: true
          length_sort_mode: "char"
          reuse_query_state: false
d31c7f65   tangwang   补充云服务reranker
129
130
131
        dashscope_rerank:
          model_name: "qwen3-rerank"
          endpoint: "https://dashscope.aliyuncs.com/compatible-api/v1/reranks"
0d3e73ba   tangwang   rerank mini batch
132
          api_key_env: "RERANK_DASHSCOPE_API_KEY_CN"
d31c7f65   tangwang   补充云服务reranker
133
134
          timeout_sec: 15.0
          top_n_cap: 0
0d3e73ba   tangwang   rerank mini batch
135
          batchsize: 64  # 0关闭;>0并发小包调度(top_n/top_n_cap 仍生效,分包后全局截断)
d31c7f65   tangwang   补充云服务reranker
136
137
138
          instruct: "Given a shopping query, rank product titles by relevance"
          max_retries: 2
          retry_backoff_sec: 0.2
701ae503   tangwang   docs
139
140
  ```
  
d31c7f65   tangwang   补充云服务reranker
141
142
143
144
145
146
  DashScope endpoint 地域示例:
  - 中国:`https://dashscope.aliyuncs.com/compatible-api/v1/reranks`
  - 新加坡:`https://dashscope-intl.aliyuncs.com/compatible-api/v1/reranks`
  - 美国:`https://dashscope-us.aliyuncs.com/compatible-api/v1/reranks`
  
  DashScope 认证:
0d3e73ba   tangwang   rerank mini batch
147
148
149
150
  - `api_key_env` 必填,表示该后端读取哪个环境变量作为 API Key
  - 推荐按地域分别注入:
    - `RERANK_DASHSCOPE_API_KEY_CN=...`
    - `RERANK_DASHSCOPE_API_KEY_US=...`
d31c7f65   tangwang   补充云服务reranker
151
  
701ae503   tangwang   docs
152
153
154
  - 服务端口、请求限制等仍在 `reranker/config.py`(或环境变量 `RERANKER_PORT`、`RERANKER_HOST`)。
  
  ## 运行
d90e7428   tangwang   补充重排
155
  ```bash
07cf5a93   tangwang   START_EMBEDDING=...
156
  ./scripts/start_reranker.sh
d90e7428   tangwang   补充重排
157
  ```
3d508beb   tangwang   reranker-4b-gguf
158
  该脚本会按当前 `services.rerank.backend` 自动选择对应的独立 venv;首次请先执行 `./scripts/setup_reranker_venv.sh <backend>`
d90e7428   tangwang   补充重排
159
  
9f5994b4   tangwang   reranker
160
161
162
163
164
165
  ## 性能压测(1000 docs)
  ```bash
  ./scripts/benchmark_reranker_1000docs.sh
  ```
  输出目录:`perf_reports/<date>/reranker_1000docs/`
  
d90e7428   tangwang   补充重排
166
167
168
169
170
  ## API
  ### Health
  ```
  GET /health
  ```
701ae503   tangwang   docs
171
  Response 含 `backend`(当前后端名)、`model`、`model_loaded`、`status`
d90e7428   tangwang   补充重排
172
173
174
175
176
177
178
179
  
  ### Rerank
  ```
  POST /rerank
  Content-Type: application/json
  
  {
    "query": "wireless mouse",
d31c7f65   tangwang   补充云服务reranker
180
181
    "docs": ["logitech mx master", "usb cable", "wireless mouse bluetooth"],
    "top_n": 10
d90e7428   tangwang   补充重排
182
183
184
  }
  ```
  
d31c7f65   tangwang   补充云服务reranker
185
  `top_n` 为可选字段:
4823f463   tangwang   qwen3_vllm_score ...
186
  - 对本地后端(`qwen3_vllm` / `qwen3_transformers` / `qwen3_transformers_packed` / `qwen3_gguf` / `qwen3_gguf_06b` / `bge`)通常会忽略,仍返回全量分数。
d31c7f65   tangwang   补充云服务reranker
187
188
  -`dashscope_rerank` 可用于控制云端返回的候选量,建议设置为 `page+size`(例如分页 `from=20,size=10` 时传 `30`)。
  
d90e7428   tangwang   补充重排
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
  Response:
  ```
  {
    "scores": [0.93, 0.02, 0.88],
    "meta": {
      "input_docs": 3,
      "usable_docs": 3,
      "unique_docs": 3,
      "dedup_ratio": 0.0,
      "elapsed_ms": 12.4,
      "model": "BAAI/bge-reranker-v2-m3",
      "device": "cuda",
      "fp16": true,
      "batch_size": 64,
      "max_length": 512,
      "normalize": true,
      "service_elapsed_ms": 13.1
    }
  }
  ```
  
  ## Logging
  The service uses standard Python logging. For structured logs and full output,
  run uvicorn with:
  ```bash
  uvicorn reranker.server:app --host 0.0.0.0 --port 6007 --log-level info
  ```
  
  ## Notes
701ae503   tangwang   docs
218
219
  - 无请求级缓存;输入按字符串去重后推理,再按原始顺序回填分数。
  - 空或 null 的 doc 跳过并计为 0。
fb973d19   tangwang   configs
220
  - **Qwen3-vLLM 分批策略**`docs` 请求体可为 1000+,服务端会按 `infer_batch_size` 拆分;当 `sort_by_doc_length=true` 时,会先按文档长度排序后分批,减少 padding 开销,最终再按输入顺序回填分数。
af827ce9   tangwang   rerank
221
  - 运行时可用环境变量临时覆盖批量参数:`RERANK_VLLM_INFER_BATCH_SIZE`、`RERANK_VLLM_SORT_BY_DOC_LENGTH`
701ae503   tangwang   docs
222
  - **Qwen3-vLLM**:参考 [Qwen3-Reranker-0.6B](https://huggingface.co/Qwen/Qwen3-Reranker-0.6B),需 GPU 与较多显存;与 BGE 相比适合长文本、高吞吐场景(vLLM 前缀缓存)。
fb973d19   tangwang   configs
223
  - **Qwen3-Transformers**:官方 Transformers Usage 方式,无需 vLLM;适合 CPU 或小显存。默认 `attn_implementation: "sdpa"`;若已安装 `flash_attn` 可设 `flash_attention_2`(未安装时服务会自动回退到 sdpa)。
4823f463   tangwang   qwen3_vllm_score ...
224
  - **Qwen3-Transformers-Packed**:仍使用 Hugging Face Transformers 与 PyTorch CUDA 内核,只定制 packed 输入、`position_ids` 和 4D `attention_mask`。它更适合在线检索里的“一个 query 对几百个短 doc”场景;默认 `attn_implementation: "eager"` 以保证自定义 mask 兼容性,若你的 `torch/transformers` 版本已验证支持,可再压测 `"sdpa"`
3d508beb   tangwang   reranker-4b-gguf
225
  - **Qwen3-GGUF**:参考 [DevQuasar/Qwen.Qwen3-Reranker-4B-GGUF](https://huggingface.co/DevQuasar/Qwen.Qwen3-Reranker-4B-GGUF)。单卡 T4 且仅剩约 `4.8~6GB` 显存时,推荐 `Q8_0 + n_ctx=384 + n_gpu_layers=24 + flash_attn=true + offload_kqv=true` 起步;若启动 OOM,优先把 `n_gpu_layers` 下调到 `20`,再把 `n_ctx` 下调到 `320`。`infer_batch_size` 在 GGUF 后端是服务侧 work chunk,大多不如 `n_gpu_layers` / `n_ctx` 关键。
5c21a485   tangwang   qwen3-reranker-0....
226
  - **Qwen3-GGUF-0.6B**:参考 [ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF](https://huggingface.co/ggml-org/Qwen3-Reranker-0.6B-Q8_0-GGUF)。它的优点是权重小、显存占用低,单进程实测约 `0.9~1.1 GiB`;但在当前 llama.cpp 串行打分接法下,`1 query + 400 titles` 的实测延迟仍约 `265s`。因此它更适合低显存功能后备,不适合作为在线低延迟主 reranker。