RESULTS.md
3.69 KB
Reranker benchmark: qwen3_vllm vs qwen3_vllm_score × instruction_format
Date: 2026-03-25
Host: single GPU (Tesla T4, ~16 GiB), CUDA 12.8 (see nvidia-smi during run).
Configuration (from config/config.yaml)
Shared across both backends for this run:
| Key | Value |
|---|---|
model_name |
Qwen/Qwen3-Reranker-0.6B |
max_model_len |
160 |
infer_batch_size |
100 |
sort_by_doc_length |
true |
enable_prefix_caching |
true |
enforce_eager |
false |
dtype |
float16 |
tensor_parallel_size |
1 |
gpu_memory_utilization |
0.20 |
instruction |
Rank products by query with category & style match prioritized |
qwen3_vllm uses vLLM generate + logprobs (.venv-reranker).
qwen3_vllm_score uses vLLM LLM.score() (.venv-reranker-score, pinned vLLM stack per reranker/README.md).
Methodology
- Script:
python scripts/benchmark_reranker_random_titles.py 100,200,400,600,800,1000 --repeat 5with--seed 99(see note below),--quiet-runs,--timeout 360. - Titles: default file
/home/ubuntu/rerank_test/titles.1.8w(one title per line). - Query: default
健身女生T恤短袖. - Each scenario: 3 warm-up requests at
n=400(not timed), then 5 timed runs pern. - Metric: client wall time for
POST /rerank(localhost), milliseconds. - After each
services.rerank.backend/instruction_formatchange:./restart.sh reranker, thenGET /healthuntilbackendandinstruction_formatmatched the intended scenario (extendedreranker/server.pyto exposeinstruction_formatwhen the backend defines_instruction_format).
Note on RNG seed: With --seed 42, some runs occasionally lost one sample at n=600 (non-200 or transport error). All figures below use --seed 99 so every cell has 5/5 successful runs and comparable sampled titles.
Raw artifacts
JSON aggregates (means, stdev, raw values_ms): same directory, qwen3_vllm_{compact,standard}.json, qwen3_vllm_score_{compact,standard}.json.
Results — mean latency (ms)
| backend | instruction_format | n=100 | n=200 | n=400 | n=600 | n=800 | n=1000 |
|---|---|---|---|---|---|---|---|
qwen3_vllm |
compact |
213.5 | 418.0 | 861.4 | 1263.4 | 1744.3 | 2162.2 |
qwen3_vllm |
standard |
254.9 | 475.4 | 909.7 | 1353.2 | 1912.5 | 2406.7 |
qwen3_vllm_score |
compact |
239.2 | 480.2 | 966.2 | 1433.5 | 1937.2 | 2428.4 |
qwen3_vllm_score |
standard |
299.6 | 591.8 | 1178.9 | 1773.7 | 2341.6 | 2931.7 |
Short interpretation
compactvsstandard: For both backends,compactis faster on this setup (shorter / different chat template vs fixed yes/no system prompt + user block — seereranker/backends/qwen3_vllm.py/qwen3_vllm_score.py).qwen3_vllmvsqwen3_vllm_score: Atn=1000,qwen3_vllm+compactis the fastest row (~2162 ms mean);qwen3_vllm_score+standardis the slowest (~2932 ms). Ordering can change on other GPUs / vLLM versions / batching.- Repo default after tests:
services.rerank.backend: qwen3_vllm_score,instruction_format: compacton bothqwen3_vllmandqwen3_vllm_scoreblocks (patch script keeps them aligned).
Tooling added / changed
reranker/server.py:/healthincludesinstruction_formatwhen the active backend sets_instruction_format.scripts/benchmark_reranker_random_titles.py:--tag,--json-summary-out,--quiet-runs.scripts/patch_rerank_vllm_benchmark_config.py: surgical YAML patch (preserves newlines).scripts/run_reranker_vllm_instruction_benchmark.sh: full matrix driver (continues if a benchmark exits non-zero; uses--timeout 360).