perf_reports/20260311/reranker_1000docs/report.md

# Reranker 1000-doc Performance Report (2026-03-11)
Workload profile:
- backend: `qwen3_vllm` (`Qwen/Qwen3-Reranker-0.6B`)
- query: short e-commerce text (<100 tokens)
- docs/request: 1000 short titles/title+brief
- options: `sort_by_doc_length=true`, `length_sort_mode=char`
## Results
| infer_batch_size | concurrency | requests | rps | avg_ms | p95_ms | p99_ms |
|---:|---:|---:|---:|---:|---:|---:|
| 24 | 1 | 4 | 0.34 | 2962.42 | 4758.59 | 4969.54 |
| 32 | 1 | 4 | 0.56 | 1756.63 | 2285.96 | 2389.21 |
| 48 | 1 | 4 | 0.63 | 1570.78 | 2111.8 | 2206.45 |
| 64 | 1 | 4 | 0.68 | 1448.87 | 2014.51 | 2122.44 |
| 80 | 1 | 3 | 0.64 | 1546.78 | 2091.47 | 2157.13 |
| 96 | 1 | 3 | 0.64 | 1534.44 | 2202.48 | 2288.99 |
| 96 | 1 | 4 | 0.52 | 1914.41 | 2215.05 | 2216.08 |
| 24 | 4 | 8 | 0.46 | 8614.9 | 9886.68 | 9887.0 |
| 32 | 4 | 8 | 0.62 | 6432.39 | 6594.11 | 6595.41 |
| 48 | 4 | 8 | 0.72 | 5451.5 | 5495.01 | 5500.29 |
| 64 | 4 | 8 | 0.76 | 5217.79 | 5329.15 | 5332.3 |
| 80 | 4 | 6 | 0.49 | 7069.9 | 9198.54 | 9208.77 |
| 96 | 4 | 6 | 0.76 | 4302.86 | 5139.19 | 5149.08 |
| 96 | 4 | 8 | 0.81 | 4852.78 | 5038.37 | 5058.76 |
## Decision
- For latency-sensitive online rerank (single request with 1000 docs), `infer_batch_size=64` gives the best c=1 latency in this run set.
- For higher concurrency c=4, `infer_batch_size=96` has slightly higher throughput, but degrades c=1 latency noticeably.
- Default kept at **`infer_batch_size=64`** as balanced/safer baseline for mixed traffic.
## Reproduce
```bash
./scripts/benchmark_reranker_1000docs.sh
```