# Reranker 1000-doc Performance Report (2026-03-11) Workload profile: - backend: `qwen3_vllm` (`Qwen/Qwen3-Reranker-0.6B`) - query: short e-commerce text (<100 tokens) - docs/request: 1000 short titles/title+brief - options: `sort_by_doc_length=true` ## Results | infer_batch_size | concurrency | requests | rps | avg_ms | p95_ms | p99_ms | |---:|---:|---:|---:|---:|---:|---:| | 24 | 1 | 4 | 0.34 | 2962.42 | 4758.59 | 4969.54 | | 32 | 1 | 4 | 0.56 | 1756.63 | 2285.96 | 2389.21 | | 48 | 1 | 4 | 0.63 | 1570.78 | 2111.8 | 2206.45 | | 64 | 1 | 4 | 0.68 | 1448.87 | 2014.51 | 2122.44 | | 80 | 1 | 3 | 0.64 | 1546.78 | 2091.47 | 2157.13 | | 96 | 1 | 3 | 0.64 | 1534.44 | 2202.48 | 2288.99 | | 96 | 1 | 4 | 0.52 | 1914.41 | 2215.05 | 2216.08 | | 24 | 4 | 8 | 0.46 | 8614.9 | 9886.68 | 9887.0 | | 32 | 4 | 8 | 0.62 | 6432.39 | 6594.11 | 6595.41 | | 48 | 4 | 8 | 0.72 | 5451.5 | 5495.01 | 5500.29 | | 64 | 4 | 8 | 0.76 | 5217.79 | 5329.15 | 5332.3 | | 80 | 4 | 6 | 0.49 | 7069.9 | 9198.54 | 9208.77 | | 96 | 4 | 6 | 0.76 | 4302.86 | 5139.19 | 5149.08 | | 96 | 4 | 8 | 0.81 | 4852.78 | 5038.37 | 5058.76 | ## Decision - For latency-sensitive online rerank (single request with 1000 docs), `infer_batch_size=64` gives the best c=1 latency in this run set. - For higher concurrency c=4, `infer_batch_size=96` has slightly higher throughput, but degrades c=1 latency noticeably. - Default kept at **`infer_batch_size=64`** as balanced/safer baseline for mixed traffic. ## Reproduce ```bash ./scripts/benchmark_reranker_1000docs.sh ```