report.md
1.5 KB
Reranker 1000-doc Performance Report (2026-03-11)
Workload profile:
- backend:
qwen3_vllm(Qwen/Qwen3-Reranker-0.6B) - query: short e-commerce text (
- docs/request: 1000 short titles/title+brief
- options:
sort_by_doc_length=true
Results
| infer_batch_size | concurrency | requests | rps | avg_ms | p95_ms | p99_ms |
|---|---|---|---|---|---|---|
| 24 | 1 | 4 | 0.34 | 2962.42 | 4758.59 | 4969.54 |
| 32 | 1 | 4 | 0.56 | 1756.63 | 2285.96 | 2389.21 |
| 48 | 1 | 4 | 0.63 | 1570.78 | 2111.8 | 2206.45 |
| 64 | 1 | 4 | 0.68 | 1448.87 | 2014.51 | 2122.44 |
| 80 | 1 | 3 | 0.64 | 1546.78 | 2091.47 | 2157.13 |
| 96 | 1 | 3 | 0.64 | 1534.44 | 2202.48 | 2288.99 |
| 96 | 1 | 4 | 0.52 | 1914.41 | 2215.05 | 2216.08 |
| 24 | 4 | 8 | 0.46 | 8614.9 | 9886.68 | 9887.0 |
| 32 | 4 | 8 | 0.62 | 6432.39 | 6594.11 | 6595.41 |
| 48 | 4 | 8 | 0.72 | 5451.5 | 5495.01 | 5500.29 |
| 64 | 4 | 8 | 0.76 | 5217.79 | 5329.15 | 5332.3 |
| 80 | 4 | 6 | 0.49 | 7069.9 | 9198.54 | 9208.77 |
| 96 | 4 | 6 | 0.76 | 4302.86 | 5139.19 | 5149.08 |
| 96 | 4 | 8 | 0.81 | 4852.78 | 5038.37 | 5058.76 |
Decision
- For latency-sensitive online rerank (single request with 1000 docs),
infer_batch_size=64gives the best c=1 latency in this run set. - For higher concurrency c=4,
infer_batch_size=96has slightly higher throughput, but degrades c=1 latency noticeably. - Default kept at
infer_batch_size=64as balanced/safer baseline for mixed traffic.
Reproduce
./scripts/benchmark_reranker_1000docs.sh