Reranker benchmark: qwen3_vllm vs qwen3_vllm_score × instruction_format
Date: 2026-03-25
Host: single GPU (Tesla T4, ~16 GiB), CUDA 12.8 (see nvidia-smi during run).
Configuration (from config/config.yaml)
Shared across both backends for this run:
| Key | Value |
|---|---|
model_name |
Qwen/Qwen3-Reranker-0.6B |
max_model_len |
160 |
infer_batch_size |
100 |
sort_by_doc_length |
true |
enable_prefix_caching |
true |
enforce_eager |
false |
dtype |
float16 |
tensor_parallel_size |
1 |
gpu_memory_utilization |
0.20 |
instruction |
Rank products by query with category & style match prioritized |
qwen3_vllm uses vLLM generate + logprobs (.venv-reranker).
qwen3_vllm_score uses vLLM LLM.score() (.venv-reranker-score, pinned vLLM stack per reranker/README.md).
Methodology
- Script:
python scripts/benchmark_reranker_random_titles.py 100,200,400,600,800,1000 --repeat 5with--seed 99(see note below),--quiet-runs,--timeout 360. - Titles: default file
/home/ubuntu/rerank_test/titles.1.8w(one title per line). - Query: default
健身女生T恤短袖. - Each scenario: 3 warm-up requests at
n=400(not timed), then 5 timed runs pern. - Metric: client wall time for
POST /rerank(localhost), milliseconds. - After each
services.rerank.backend/instruction_formatchange:./restart.sh reranker, thenGET /healthuntilbackendandinstruction_formatmatched the intended scenario (extendedreranker/server.pyto exposeinstruction_formatwhen the backend defines_instruction_format).
Note on RNG seed: With --seed 42, some runs occasionally lost one sample at n=600 (non-200 or transport error). All figures below use --seed 99 so every cell has 5/5 successful runs and comparable sampled titles.
Raw artifacts
JSON aggregates (means, stdev, raw values_ms): same directory, qwen3_vllm_{compact,standard}.json, qwen3_vllm_score_{compact,standard}.json.
Results — mean latency (ms)
| backend | instruction_format | n=100 | n=200 | n=400 | n=600 | n=800 | n=1000 |
|---|---|---|---|---|---|---|---|
qwen3_vllm |
compact |
213.5 | 418.0 | 861.4 | 1263.4 | 1744.3 | 2162.2 |
qwen3_vllm |
standard |
254.9 | 475.4 | 909.7 | 1353.2 | 1912.5 | 2406.7 |
qwen3_vllm_score |
compact |
239.2 | 480.2 | 966.2 | 1433.5 | 1937.2 | 2428.4 |
qwen3_vllm_score |
standard |
299.6 | 591.8 | 1178.9 | 1773.7 | 2341.6 | 2931.7 |
Short interpretation
compactvsstandard: For both backends,compactis faster on this setup (shorter / different chat template vs fixed yes/no system prompt + user block — seereranker/backends/qwen3_vllm.py/qwen3_vllm_score.py).qwen3_vllmvsqwen3_vllm_score: Atn=1000,qwen3_vllm+compactis the fastest row (~2162 ms mean);qwen3_vllm_score+standardis the slowest (~2932 ms). Ordering can change on other GPUs / vLLM versions / batching.- Repo / 运维默认(当前):
services.rerank.backend多为qwen3_vllm_score;score 块推荐instruction_format: compact(与后端代码默认值一致)。qwen3_vllm块的instruction_format可与 generate 后端单独配置,不必与 score 强制相同。
Tooling added / changed
reranker/server.py:/healthincludesinstruction_formatwhen the active backend sets_instruction_format.scripts/benchmark_reranker_random_titles.py:--tag,--json-summary-out,--quiet-runs.scripts/patch_rerank_vllm_benchmark_config.py: surgical YAML patch (preserves newlines).scripts/run_reranker_vllm_instruction_benchmark.sh: full matrix driver (continues if a benchmark exits non-zero; uses--timeout 360).
Addendum: qwen3_vllm_score after attention auto-select (FLASHINFER on T4)
Do not replace the table above — it records the older qwen3_vllm_score behaviour (roughly: sm<8 时向 vLLM 注入 attention_config / TRITON_ATTN,且代码里 instruction_format 默认曾为 standard).
What changed in code / ops
| Area | Before (baseline table) | After (this addendum) |
|---|---|---|
| Attention | Backend forced / steered attention on T4 (e.g. TRITON_ATTN path) |
No attention_config in LLM(...); vLLM auto — on this T4 run, logs show FLASHINFER |
| Config surface | vllm_attention_backend / RERANK_VLLM_ATTENTION_BACKEND 等 |
Removed(少 YAML/环境变量分支,逻辑收敛) |
Code default instruction_format |
qwen3_vllm_score 默认 standard |
与 qwen3_vllm 对齐为 compact(仍可在 YAML 写 standard) |
| Smoke / 启动 | — | scripts/smoke_qwen3_vllm_score_backend.py;scripts/start_reranker.sh 将 venv bin 置于 PATH(FLASHINFER JIT 依赖 venv 内的 ninja) |
Micro-benchmark (same machine, isolated): ~927.5 ms → ~673.1 ms at n=400 docs on LLM.score() steady state (~28%), after removing the forced attention path and letting vLLM pick FLASHINFER.
Re-benchmark (HTTP POST /rerank, same methodology as §Methodology)
- Purpose: Same comparison axis as the main table (
qwen3_vllm_scoreonly), after the FLASHINFER-friendly backend. - Controlled for
max_model_len:services.rerank.backends.qwen3_vllm_score.max_model_lenset to 160 for this run so numbers are comparable to the baseline rows (also 160). Productionconfig.yamlmay use a different value (e.g. 196); adjust YAML before repeating the benchmark if you need prod-shaped latency. - Seed / repeats:
--seed 99,--repeat 5, same script and title file as §Methodology. - Artifacts:
qwen3_vllm_score_compact_post_flashinfer_opt.json,qwen3_vllm_score_standard_post_flashinfer_opt.json.
qwen3_vllm_score — mean latency (ms), post optimization
| instruction_format | n=100 | n=200 | n=400 | n=600 | n=800 | n=1000 | vs baseline same row (approx.) |
|---|---|---|---|---|---|---|---|
compact |
178.5 | 351.7 | 688.2 | 1024.0 | 1375.8 | 1752.4 | e.g. n=400 −28.8%, n=1000 −27.8% vs 966.2 / 2428.4 |
standard |
198.4 | 386.4 | 762.83 | 1174.6 | 1540.10 | 1970.14 | e.g. n=400 −33.9%, n=1000 −33.3% vs 1178.9 / 2931.7 |
instruction_format: standard 的优化点(本版): 与 compact 共享同一套 vLLM attention 自动选择;不再在 T4 上单独锁死 TRITON_ATTN。Prompt 仍比 compact 更长(固定 yes/no system + 官方前缀模板),因此 absolute 延迟仍高于 compact,但相对旧版 standard 行降幅与 compact 同量级(上表)。
Takeaway: Under T4 + vLLM 0.18 score path, auto attention (FLASHINFER) plus compact default brings qwen3_vllm_score much closer to qwen3_vllm timings from the baseline matrix; re-run the full 4-way matrix if you need refreshed qwen3_vllm rows on the same commit.