report.md 1.5 KB

Reranker 1000-doc Performance Report (2026-03-11)

Workload profile:

  • backend: qwen3_vllm (Qwen/Qwen3-Reranker-0.6B)
  • query: short e-commerce text (
  • docs/request: 1000 short titles/title+brief
  • options: sort_by_doc_length=true

Results

infer_batch_size concurrency requests rps avg_ms p95_ms p99_ms
24 1 4 0.34 2962.42 4758.59 4969.54
32 1 4 0.56 1756.63 2285.96 2389.21
48 1 4 0.63 1570.78 2111.8 2206.45
64 1 4 0.68 1448.87 2014.51 2122.44
80 1 3 0.64 1546.78 2091.47 2157.13
96 1 3 0.64 1534.44 2202.48 2288.99
96 1 4 0.52 1914.41 2215.05 2216.08
24 4 8 0.46 8614.9 9886.68 9887.0
32 4 8 0.62 6432.39 6594.11 6595.41
48 4 8 0.72 5451.5 5495.01 5500.29
64 4 8 0.76 5217.79 5329.15 5332.3
80 4 6 0.49 7069.9 9198.54 9208.77
96 4 6 0.76 4302.86 5139.19 5149.08
96 4 8 0.81 4852.78 5038.37 5058.76

Decision

  • For latency-sensitive online rerank (single request with 1000 docs), infer_batch_size=64 gives the best c=1 latency in this run set.
  • For higher concurrency c=4, infer_batch_size=96 has slightly higher throughput, but degrades c=1 latency noticeably.
  • Default kept at infer_batch_size=64 as balanced/safer baseline for mixed traffic.

Reproduce

./scripts/benchmark_reranker_1000docs.sh