RESULTS.md 7.03 KB

Reranker benchmark: qwen3_vllm vs qwen3_vllm_score × instruction_format

Date: 2026-03-25
Host: single GPU (Tesla T4, ~16 GiB), CUDA 12.8 (see nvidia-smi during run).

Configuration (from config/config.yaml)

Shared across both backends for this run:

Key Value
model_name Qwen/Qwen3-Reranker-0.6B
max_model_len 160
infer_batch_size 100
sort_by_doc_length true
enable_prefix_caching true
enforce_eager false
dtype float16
tensor_parallel_size 1
gpu_memory_utilization 0.20
instruction Rank products by query with category & style match prioritized

qwen3_vllm uses vLLM generate + logprobs (.venv-reranker).
qwen3_vllm_score uses vLLM LLM.score() (.venv-reranker-score, pinned vLLM stack per reranker/README.md).

Methodology

  • Script: python scripts/benchmark_reranker_random_titles.py 100,200,400,600,800,1000 --repeat 5 with --seed 99 (see note below), --quiet-runs, --timeout 360.
  • Titles: default file /home/ubuntu/rerank_test/titles.1.8w (one title per line).
  • Query: default 健身女生T恤短袖.
  • Each scenario: 3 warm-up requests at n=400 (not timed), then 5 timed runs per n.
  • Metric: client wall time for POST /rerank (localhost), milliseconds.
  • After each services.rerank.backend / instruction_format change: ./restart.sh reranker, then GET /health until backend and instruction_format matched the intended scenario (extended reranker/server.py to expose instruction_format when the backend defines _instruction_format).

Note on RNG seed: With --seed 42, some runs occasionally lost one sample at n=600 (non-200 or transport error). All figures below use --seed 99 so every cell has 5/5 successful runs and comparable sampled titles.

Raw artifacts

JSON aggregates (means, stdev, raw values_ms): same directory, qwen3_vllm_{compact,standard}.json, qwen3_vllm_score_{compact,standard}.json.

Results — mean latency (ms)

backend instruction_format n=100 n=200 n=400 n=600 n=800 n=1000
qwen3_vllm compact 213.5 418.0 861.4 1263.4 1744.3 2162.2
qwen3_vllm standard 254.9 475.4 909.7 1353.2 1912.5 2406.7
qwen3_vllm_score compact 239.2 480.2 966.2 1433.5 1937.2 2428.4
qwen3_vllm_score standard 299.6 591.8 1178.9 1773.7 2341.6 2931.7

Short interpretation

  1. compact vs standard: For both backends, compact is faster on this setup (shorter / different chat template vs fixed yes/no system prompt + user block — see reranker/backends/qwen3_vllm.py / qwen3_vllm_score.py).
  2. qwen3_vllm vs qwen3_vllm_score: At n=1000, qwen3_vllm + compact is the fastest row (~2162 ms mean); qwen3_vllm_score + standard is the slowest (~2932 ms). Ordering can change on other GPUs / vLLM versions / batching.
  3. Repo / 运维默认(当前)services.rerank.backend 多为 qwen3_vllm_scorescore 块推荐 instruction_format: compact(与后端代码默认值一致)。qwen3_vllm 块的 instruction_format 可与 generate 后端单独配置,不必与 score 强制相同。

Tooling added / changed

  • reranker/server.py: /health includes instruction_format when the active backend sets _instruction_format.
  • scripts/benchmark_reranker_random_titles.py: --tag, --json-summary-out, --quiet-runs.
  • scripts/patch_rerank_vllm_benchmark_config.py: surgical YAML patch (preserves newlines).
  • scripts/run_reranker_vllm_instruction_benchmark.sh: full matrix driver (continues if a benchmark exits non-zero; uses --timeout 360).

Addendum: qwen3_vllm_score after attention auto-select (FLASHINFER on T4)

Do not replace the table above — it records the older qwen3_vllm_score behaviour (roughly: sm<8 时向 vLLM 注入 attention_config / TRITON_ATTN,且代码里 instruction_format 默认曾为 standard).

What changed in code / ops

Area Before (baseline table) After (this addendum)
Attention Backend forced / steered attention on T4 (e.g. TRITON_ATTN path) No attention_config in LLM(...); vLLM auto — on this T4 run, logs show FLASHINFER
Config surface vllm_attention_backend / RERANK_VLLM_ATTENTION_BACKEND Removed(少 YAML/环境变量分支,逻辑收敛)
Code default instruction_format qwen3_vllm_score 默认 standard qwen3_vllm 对齐为 compact(仍可在 YAML 写 standard
Smoke / 启动 scripts/smoke_qwen3_vllm_score_backend.pyscripts/start_reranker.shvenv bin 置于 PATH(FLASHINFER JIT 依赖 venv 内的 ninja

Micro-benchmark (same machine, isolated): ~927.5 ms → ~673.1 ms at n=400 docs on LLM.score() steady state (~28%), after removing the forced attention path and letting vLLM pick FLASHINFER.

Re-benchmark (HTTP POST /rerank, same methodology as §Methodology)

  • Purpose: Same comparison axis as the main table (qwen3_vllm_score only), after the FLASHINFER-friendly backend.
  • Controlled for max_model_len: services.rerank.backends.qwen3_vllm_score.max_model_len set to 160 for this run so numbers are comparable to the baseline rows (also 160). Production config.yaml may use a different value (e.g. 196); adjust YAML before repeating the benchmark if you need prod-shaped latency.
  • Seed / repeats: --seed 99, --repeat 5, same script and title file as §Methodology.
  • Artifacts: qwen3_vllm_score_compact_post_flashinfer_opt.json, qwen3_vllm_score_standard_post_flashinfer_opt.json.

qwen3_vllm_score — mean latency (ms), post optimization

instruction_format n=100 n=200 n=400 n=600 n=800 n=1000 vs baseline same row (approx.)
compact 178.5 351.7 688.2 1024.0 1375.8 1752.4 e.g. n=400 −28.8%, n=1000 −27.8% vs 966.2 / 2428.4
standard 198.4 386.4 762.83 1174.6 1540.10 1970.14 e.g. n=400 −33.9%, n=1000 −33.3% vs 1178.9 / 2931.7

instruction_format: standard 的优化点(本版):compact 共享同一套 vLLM attention 自动选择;不再在 T4 上单独锁死 TRITON_ATTN。Prompt 仍比 compact 更长(固定 yes/no system + 官方前缀模板),因此 absolute 延迟仍高于 compact,但相对旧版 standard 行降幅与 compact 同量级(上表)。

Takeaway: Under T4 + vLLM 0.18 score path, auto attention (FLASHINFER) plus compact default brings qwen3_vllm_score much closer to qwen3_vllm timings from the baseline matrix; re-run the full 4-way matrix if you need refreshed qwen3_vllm rows on the same commit.