Blame view

perf_reports/20260311/reranker_1000docs/report.md 1.52 KB
9f5994b4   tangwang   reranker
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
  # Reranker 1000-doc Performance Report (2026-03-11)
  
  Workload profile:
  - backend: `qwen3_vllm` (`Qwen/Qwen3-Reranker-0.6B`)
  - query: short e-commerce text (<100 tokens)
  - docs/request: 1000 short titles/title+brief
  - options: `sort_by_doc_length=true`, `length_sort_mode=char`
  
  ## Results
  
  | infer_batch_size | concurrency | requests | rps | avg_ms | p95_ms | p99_ms |
  |---:|---:|---:|---:|---:|---:|---:|
  | 24 | 1 | 4 | 0.34 | 2962.42 | 4758.59 | 4969.54 |
  | 32 | 1 | 4 | 0.56 | 1756.63 | 2285.96 | 2389.21 |
  | 48 | 1 | 4 | 0.63 | 1570.78 | 2111.8 | 2206.45 |
  | 64 | 1 | 4 | 0.68 | 1448.87 | 2014.51 | 2122.44 |
  | 80 | 1 | 3 | 0.64 | 1546.78 | 2091.47 | 2157.13 |
  | 96 | 1 | 3 | 0.64 | 1534.44 | 2202.48 | 2288.99 |
  | 96 | 1 | 4 | 0.52 | 1914.41 | 2215.05 | 2216.08 |
  | 24 | 4 | 8 | 0.46 | 8614.9 | 9886.68 | 9887.0 |
  | 32 | 4 | 8 | 0.62 | 6432.39 | 6594.11 | 6595.41 |
  | 48 | 4 | 8 | 0.72 | 5451.5 | 5495.01 | 5500.29 |
  | 64 | 4 | 8 | 0.76 | 5217.79 | 5329.15 | 5332.3 |
  | 80 | 4 | 6 | 0.49 | 7069.9 | 9198.54 | 9208.77 |
  | 96 | 4 | 6 | 0.76 | 4302.86 | 5139.19 | 5149.08 |
  | 96 | 4 | 8 | 0.81 | 4852.78 | 5038.37 | 5058.76 |
  
  ## Decision
  
  - For latency-sensitive online rerank (single request with 1000 docs), `infer_batch_size=64` gives the best c=1 latency in this run set.
  - For higher concurrency c=4, `infer_batch_size=96` has slightly higher throughput, but degrades c=1 latency noticeably.
  - Default kept at **`infer_batch_size=64`** as balanced/safer baseline for mixed traffic.
  
  ## Reproduce
  
  ```bash
  ./scripts/benchmark_reranker_1000docs.sh
  ```