report.md 1.5 KB
Edit Raw Blame History


Reranker 1000-doc Performance Report (2026-03-11)
Workload profile:


backend: qwen3_vllm (Qwen/Qwen3-Reranker-0.6B)
query: short e-commerce text (
docs/request: 1000 short titles/title+brief
options: sort_by_doc_length=true

Results


infer_batch_size
concurrency
requests
rps
avg_ms
p95_ms
p99_ms


24
1
4
0.34
2962.42
4758.59
4969.54


32
1
4
0.56
1756.63
2285.96
2389.21


48
1
4
0.63
1570.78
2111.8
2206.45


64
1
4
0.68
1448.87
2014.51
2122.44


80
1
3
0.64
1546.78
2091.47
2157.13


96
1
3
0.64
1534.44
2202.48
2288.99


96
1
4
0.52
1914.41
2215.05
2216.08


24
4
8
0.46
8614.9
9886.68
9887.0


32
4
8
0.62
6432.39
6594.11
6595.41


48
4
8
0.72
5451.5
5495.01
5500.29


64
4
8
0.76
5217.79
5329.15
5332.3


80
4
6
0.49
7069.9
9198.54
9208.77


96
4
6
0.76
4302.86
5139.19
5149.08


96
4
8
0.81
4852.78
5038.37
5058.76


Decision

For latency-sensitive online rerank (single request with 1000 docs), infer_batch_size=64 gives the best c=1 latency in this run set.
For higher concurrency c=4, infer_batch_size=96 has slightly higher throughput, but degrades c=1 latency noticeably.
Default kept at infer_batch_size=64 as balanced/safer baseline for mixed traffic.

Reproduce
./scripts/benchmark_reranker_1000docs.sh