RESULTS.md 7.03 KB
Edit Raw Blame History


Reranker benchmark: qwen3_vllm vs qwen3_vllm_score × instruction_format
Date: 2026-03-25

Host: single GPU (Tesla T4, ~16 GiB), CUDA 12.8 (see nvidia-smi during run).
Configuration (from config/config.yaml)
Shared across both backends for this run:


Key
Value


model_name
Qwen/Qwen3-Reranker-0.6B


max_model_len
160


infer_batch_size
100


sort_by_doc_length
true


enable_prefix_caching
true


enforce_eager
false


dtype
float16


tensor_parallel_size
1


gpu_memory_utilization
0.20


instruction
Rank products by query with category & style match prioritized


qwen3_vllm uses vLLM generate + logprobs (.venv-reranker).

qwen3_vllm_score uses vLLM LLM.score() (.venv-reranker-score, pinned vLLM stack per reranker/README.md).
Methodology

Script: python scripts/benchmark_reranker_random_titles.py 100,200,400,600,800,1000 --repeat 5 with --seed 99 (see note below), --quiet-runs, --timeout 360.
Titles: default file /home/ubuntu/rerank_test/titles.1.8w (one title per line).
Query: default 健身女生T恤短袖.
Each scenario: 3 warm-up requests at n=400 (not timed), then 5 timed runs per n.
Metric: client wall time for POST /rerank (localhost), milliseconds.
After each services.rerank.backend / instruction_format change: ./restart.sh reranker, then GET /health until backend and instruction_format matched the intended scenario (extended reranker/server.py to expose instruction_format when the backend defines _instruction_format).


Note on RNG seed: With --seed 42, some runs occasionally lost one sample at n=600 (non-200 or transport error). All figures below use --seed 99 so every cell has 5/5 successful runs and comparable sampled titles.
Raw artifacts
JSON aggregates (means, stdev, raw values_ms): same directory, qwen3_vllm_{compact,standard}.json, qwen3_vllm_score_{compact,standard}.json.
Results — mean latency (ms)


backend
instruction_format
n=100
n=200
n=400
n=600
n=800
n=1000


qwen3_vllm
compact
213.5
418.0
861.4
1263.4
1744.3
2162.2


qwen3_vllm
standard
254.9
475.4
909.7
1353.2
1912.5
2406.7


qwen3_vllm_score
compact
239.2
480.2
966.2
1433.5
1937.2
2428.4


qwen3_vllm_score
standard
299.6
591.8
1178.9
1773.7
2341.6
2931.7


Short interpretation

compact vs standard: For both backends, compact is faster on this setup (shorter / different chat template vs fixed yes/no system prompt + user block — see reranker/backends/qwen3_vllm.py / qwen3_vllm_score.py).
qwen3_vllm vs qwen3_vllm_score: At n=1000, qwen3_vllm + compact is the fastest row (~2162 ms mean); qwen3_vllm_score + standard is the slowest (~2932 ms). Ordering can change on other GPUs / vLLM versions / batching.
Repo / 运维默认（当前）：services.rerank.backend 多为 qwen3_vllm_score；score 块推荐 instruction_format: compact（与后端代码默认值一致）。qwen3_vllm 块的 instruction_format 可与 generate 后端单独配置，不必与 score 强制相同。

Tooling added / changed

reranker/server.py: /health includes instruction_format when the active backend sets _instruction_format.
scripts/benchmark_reranker_random_titles.py: --tag, --json-summary-out, --quiet-runs.
scripts/patch_rerank_vllm_benchmark_config.py: surgical YAML patch (preserves newlines).
scripts/run_reranker_vllm_instruction_benchmark.sh: full matrix driver (continues if a benchmark exits non-zero; uses --timeout 360).


Addendum: qwen3_vllm_score after attention auto-select (FLASHINFER on T4)
Do not replace the table above — it records the older qwen3_vllm_score behaviour (roughly: sm<8 时向 vLLM 注入 attention_config / TRITON_ATTN，且代码里 instruction_format 默认曾为 standard).
What changed in code / ops


Area
Before (baseline table)
After (this addendum)


Attention
Backend forced / steered attention on T4 (e.g. TRITON_ATTN path)
No attention_config in LLM(...); vLLM auto — on this T4 run, logs show FLASHINFER


Config surface
vllm_attention_backend / RERANK_VLLM_ATTENTION_BACKEND 等
Removed（少 YAML/环境变量分支，逻辑收敛）


Code default instruction_format
qwen3_vllm_score 默认 standard
与 qwen3_vllm 对齐为 compact（仍可在 YAML 写 standard）


Smoke / 启动
—
scripts/smoke_qwen3_vllm_score_backend.py；scripts/start_reranker.sh 将 venv bin 置于 PATH（FLASHINFER JIT 依赖 venv 内的 ninja）


Micro-benchmark (same machine, isolated): ~927.5 ms → ~673.1 ms at n=400 docs on LLM.score() steady state (~28%), after removing the forced attention path and letting vLLM pick FLASHINFER.
Re-benchmark (HTTP POST /rerank, same methodology as §Methodology)

Purpose: Same comparison axis as the main table (qwen3_vllm_score only), after the FLASHINFER-friendly backend.
Controlled for max_model_len: services.rerank.backends.qwen3_vllm_score.max_model_len set to 160 for this run so numbers are comparable to the baseline rows (also 160). Production config.yaml may use a different value (e.g. 196); adjust YAML before repeating the benchmark if you need prod-shaped latency.
Seed / repeats: --seed 99, --repeat 5, same script and title file as §Methodology.
Artifacts: qwen3_vllm_score_compact_post_flashinfer_opt.json, qwen3_vllm_score_standard_post_flashinfer_opt.json.

qwen3_vllm_score — mean latency (ms), post optimization


instruction_format
n=100
n=200
n=400
n=600
n=800
n=1000
vs baseline same row (approx.)


compact
178.5
351.7
688.2
1024.0
1375.8
1752.4
e.g. n=400 −28.8%, n=1000 −27.8% vs 966.2 / 2428.4


standard
198.4
386.4
762.83
1174.6
1540.10
1970.14
e.g. n=400 −33.9%, n=1000 −33.3% vs 1178.9 / 2931.7


instruction_format: standard 的优化点（本版）： 与 compact 共享同一套 vLLM attention 自动选择；不再在 T4 上单独锁死 TRITON_ATTN。Prompt 仍比 compact 更长（固定 yes/no system + 官方前缀模板），因此 absolute 延迟仍高于 compact，但相对旧版 standard 行降幅与 compact 同量级（上表）。

Takeaway: Under T4 + vLLM 0.18 score path, auto attention (FLASHINFER) plus compact default brings qwen3_vllm_score much closer to qwen3_vllm timings from the baseline matrix; re-run the full 4-way matrix if you need refreshed qwen3_vllm rows on the same commit.