RESULTS.md 3.69 KB
Edit Raw Blame History


Reranker benchmark: qwen3_vllm vs qwen3_vllm_score × instruction_format
Date: 2026-03-25

Host: single GPU (Tesla T4, ~16 GiB), CUDA 12.8 (see nvidia-smi during run).
Configuration (from config/config.yaml)
Shared across both backends for this run:


Key
Value


model_name
Qwen/Qwen3-Reranker-0.6B


max_model_len
160


infer_batch_size
100


sort_by_doc_length
true


enable_prefix_caching
true


enforce_eager
false


dtype
float16


tensor_parallel_size
1


gpu_memory_utilization
0.20


instruction
Rank products by query with category & style match prioritized


qwen3_vllm uses vLLM generate + logprobs (.venv-reranker).

qwen3_vllm_score uses vLLM LLM.score() (.venv-reranker-score, pinned vLLM stack per reranker/README.md).
Methodology

Script: python scripts/benchmark_reranker_random_titles.py 100,200,400,600,800,1000 --repeat 5 with --seed 99 (see note below), --quiet-runs, --timeout 360.
Titles: default file /home/ubuntu/rerank_test/titles.1.8w (one title per line).
Query: default 健身女生T恤短袖.
Each scenario: 3 warm-up requests at n=400 (not timed), then 5 timed runs per n.
Metric: client wall time for POST /rerank (localhost), milliseconds.
After each services.rerank.backend / instruction_format change: ./restart.sh reranker, then GET /health until backend and instruction_format matched the intended scenario (extended reranker/server.py to expose instruction_format when the backend defines _instruction_format).


Note on RNG seed: With --seed 42, some runs occasionally lost one sample at n=600 (non-200 or transport error). All figures below use --seed 99 so every cell has 5/5 successful runs and comparable sampled titles.
Raw artifacts
JSON aggregates (means, stdev, raw values_ms): same directory, qwen3_vllm_{compact,standard}.json, qwen3_vllm_score_{compact,standard}.json.
Results — mean latency (ms)


backend
instruction_format
n=100
n=200
n=400
n=600
n=800
n=1000


qwen3_vllm
compact
213.5
418.0
861.4
1263.4
1744.3
2162.2


qwen3_vllm
standard
254.9
475.4
909.7
1353.2
1912.5
2406.7


qwen3_vllm_score
compact
239.2
480.2
966.2
1433.5
1937.2
2428.4


qwen3_vllm_score
standard
299.6
591.8
1178.9
1773.7
2341.6
2931.7


Short interpretation

compact vs standard: For both backends, compact is faster on this setup (shorter / different chat template vs fixed yes/no system prompt + user block — see reranker/backends/qwen3_vllm.py / qwen3_vllm_score.py).
qwen3_vllm vs qwen3_vllm_score: At n=1000, qwen3_vllm + compact is the fastest row (~2162 ms mean); qwen3_vllm_score + standard is the slowest (~2932 ms). Ordering can change on other GPUs / vLLM versions / batching.
Repo default after tests: services.rerank.backend: qwen3_vllm_score, instruction_format: compact on both qwen3_vllm and qwen3_vllm_score blocks (patch script keeps them aligned).

Tooling added / changed

reranker/server.py: /health includes instruction_format when the active backend sets _instruction_format.
scripts/benchmark_reranker_random_titles.py: --tag, --json-summary-out, --quiet-runs.
scripts/patch_rerank_vllm_benchmark_config.py: surgical YAML patch (preserves newlines).
scripts/run_reranker_vllm_instruction_benchmark.sh: full matrix driver (continues if a benchmark exits non-zero; uses --timeout 360).