Blame view

perf_reports/reranker_vllm_instruction/2026-03-25/RESULTS.md 7.09 KB
52ea6529   tangwang   性能测试:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
  # Reranker benchmark: `qwen3_vllm` vs `qwen3_vllm_score` × `instruction_format`
  
  **Date:** 2026-03-25  
  **Host:** single GPU (Tesla T4, ~16 GiB), CUDA 12.8 (see `nvidia-smi` during run).
  
  ## Configuration (from `config/config.yaml`)
  
  Shared across both backends for this run:
  
  | Key | Value |
  |-----|-------|
  | `model_name` | `Qwen/Qwen3-Reranker-0.6B` |
  | `max_model_len` | 160 |
  | `infer_batch_size` | 100 |
  | `sort_by_doc_length` | true |
  | `enable_prefix_caching` | true |
  | `enforce_eager` | false |
  | `dtype` | float16 |
  | `tensor_parallel_size` | 1 |
  | `gpu_memory_utilization` | 0.20 |
  | `instruction` | `Rank products by query with category & style match prioritized` |
  
  `qwen3_vllm` uses vLLM **generate + logprobs** (`.venv-reranker`).  
  `qwen3_vllm_score` uses vLLM **`LLM.score()`** (`.venv-reranker-score`, pinned vLLM stack per `reranker/README.md`).
  
  ## Methodology
  
3abbc95a   tangwang   重构(scripts): 整理sc...
28
  - Script: `python benchmarks/reranker/benchmark_reranker_random_titles.py 100,200,400,600,800,1000 --repeat 5` with **`--seed 99`** (see note below), **`--quiet-runs`**, **`--timeout 360`**.
52ea6529   tangwang   性能测试:
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
  - Titles: default file `/home/ubuntu/rerank_test/titles.1.8w` (one title per line).
  - Query: default `健身女生T恤短袖`.
  - Each scenario: **3 warm-up** requests at `n=400` (not timed), then **5 timed** runs per `n`.
  - Metric: **client wall time** for `POST /rerank` (localhost), milliseconds.
  - After each `services.rerank.backend` / `instruction_format` change: `./restart.sh reranker`, then **`GET /health`** until `backend` and `instruction_format` matched the intended scenario (extended `reranker/server.py` to expose `instruction_format` when the backend defines `_instruction_format`).
  
  **Note on RNG seed:** With `--seed 42`, some runs occasionally lost one sample at `n=600` (non-200 or transport error). All figures below use **`--seed 99`** so every cell has **5/5** successful runs and comparable sampled titles.
  
  ## Raw artifacts
  
  JSON aggregates (means, stdev, raw `values_ms`): same directory, `qwen3_vllm_{compact,standard}.json`, `qwen3_vllm_score_{compact,standard}.json`.
  
  ## Results — mean latency (ms)
  
  | backend | instruction_format | n=100 | n=200 | n=400 | n=600 | n=800 | n=1000 |
  |---------|-------------------|------:|------:|------:|------:|------:|-------:|
  | `qwen3_vllm` | `compact` | 213.5 | 418.0 | 861.4 | 1263.4 | 1744.3 | 2162.2 |
  | `qwen3_vllm` | `standard` | 254.9 | 475.4 | 909.7 | 1353.2 | 1912.5 | 2406.7 |
  | `qwen3_vllm_score` | `compact` | 239.2 | 480.2 | 966.2 | 1433.5 | 1937.2 | 2428.4 |
  | `qwen3_vllm_score` | `standard` | 299.6 | 591.8 | 1178.9 | 1773.7 | 2341.6 | 2931.7 |
  
  ## Short interpretation
  
  1. **`compact` vs `standard`:** For both backends, **`compact` is faster** on this setup (shorter / different chat template vs fixed yes/no system prompt + user block — see `reranker/backends/qwen3_vllm.py` / `qwen3_vllm_score.py`).
  2. **`qwen3_vllm` vs `qwen3_vllm_score`:** At **`n=1000`**, **`qwen3_vllm` + `compact`** is the fastest row (~2162 ms mean); **`qwen3_vllm_score` + `standard`** is the slowest (~2932 ms). Ordering can change on other GPUs / vLLM versions / batching.
b0972ff9   tangwang   qwen3_vllm_score ...
54
  3. **Repo / 运维默认(当前)**`services.rerank.backend` 多为 `qwen3_vllm_score`;**score** 块推荐 **`instruction_format: compact`**(与后端代码默认值一致)。`qwen3_vllm` 块的 `instruction_format` 可与 generate 后端单独配置,不必与 score 强制相同。
52ea6529   tangwang   性能测试:
55
56
57
58
  
  ## Tooling added / changed
  
  - `reranker/server.py`: `/health` includes `instruction_format` when the active backend sets `_instruction_format`.
3abbc95a   tangwang   重构(scripts): 整理sc...
59
60
61
  - `benchmarks/reranker/benchmark_reranker_random_titles.py`: `--tag`, `--json-summary-out`, `--quiet-runs`.
  - `benchmarks/reranker/patch_rerank_vllm_benchmark_config.py`: surgical YAML patch (preserves newlines).
  - `benchmarks/reranker/run_reranker_vllm_instruction_benchmark.sh`: full matrix driver (continues if a benchmark exits non-zero; uses `--timeout 360`).
b0972ff9   tangwang   qwen3_vllm_score ...
62
63
64
65
66
67
68
69
70
71
72
73
74
75
  
  ---
  
  ## Addendum: `qwen3_vllm_score` after attention auto-select (FLASHINFER on T4)
  
  **Do not replace the table above** — it records the **older** `qwen3_vllm_score` behaviour (roughly: sm<8 时向 vLLM 注入 `attention_config` / `TRITON_ATTN`,且代码里 `instruction_format` 默认曾为 `standard`).
  
  ### What changed in code / ops
  
  | Area | Before (baseline table) | After (this addendum) |
  |------|-------------------------|------------------------|
  | Attention | Backend forced / steered attention on T4 (e.g. `TRITON_ATTN` path) | **No** `attention_config` in `LLM(...)`; vLLM **auto** — on this T4 run, logs show **`FLASHINFER`** |
  | Config surface | `vllm_attention_backend` / `RERANK_VLLM_ATTENTION_BACKEND` 等 | **Removed**(少 YAML/环境变量分支,逻辑收敛) |
  | Code default `instruction_format` | `qwen3_vllm_score` 默认 `standard` | 与 `qwen3_vllm` 对齐为 **`compact`**(仍可在 YAML 写 `standard`) |
3abbc95a   tangwang   重构(scripts): 整理sc...
76
  | Smoke / 启动 | — | `benchmarks/reranker/smoke_qwen3_vllm_score_backend.py`;`scripts/start_reranker.sh` 将 **venv `bin` 置于 `PATH`**(FLASHINFER JIT 依赖 venv 内的 `ninja`) |
b0972ff9   tangwang   qwen3_vllm_score ...
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
  
  Micro-benchmark (same machine, isolated): **~927.5 ms → ~673.1 ms** at **n=400** docs on `LLM.score()` steady state (~**28%**), after removing the forced attention path and letting vLLM pick **FLASHINFER**.
  
  ### Re-benchmark (HTTP `POST /rerank`, same methodology as §Methodology)
  
  - **Purpose:** Same comparison axis as the main table (`qwen3_vllm_score` only), **after** the FLASHINFER-friendly backend.
  - **Controlled for `max_model_len`:** `services.rerank.backends.qwen3_vllm_score.max_model_len` set to **160** for this run so numbers are comparable to the **baseline** rows (also 160). Production `config.yaml` may use a different value (e.g. **196**); adjust YAML before repeating the benchmark if you need prod-shaped latency.
  - **Seed / repeats:** `--seed 99`, `--repeat 5`, same script and title file as §Methodology.
  - **Artifacts:** `qwen3_vllm_score_compact_post_flashinfer_opt.json`, `qwen3_vllm_score_standard_post_flashinfer_opt.json`.
  
  #### `qwen3_vllm_score` — mean latency (ms), post optimization
  
  | instruction_format | n=100 | n=200 | n=400 | n=600 | n=800 | n=1000 | vs baseline same row (approx.) |
  |--------------------|------:|------:|------:|------:|------:|-------:|--------------------------------|
  | `compact` | 178.5 | 351.7 | **688.2** | 1024.0 | 1375.8 | **1752.4** | e.g. n=400 **−28.8%**, n=1000 **−27.8%** vs 966.2 / 2428.4 |
e38dc1be   tangwang   融合公式参数调整、以及展示信息优化
92
  | `standard` | 198.4 | 386.4 | **762.83** | 1174.6 | 1540.10 | **1970.14** | e.g. n=400 **−33.9%**, n=1000 **−33.3%** vs 1178.9 / 2931.7 |
b0972ff9   tangwang   qwen3_vllm_score ...
93
94
95
96
  
  **`instruction_format: standard` 的优化点(本版):**`compact` **共享**同一套 vLLM attention 自动选择;不再在 T4 上单独锁死 `TRITON_ATTN`。Prompt 仍比 `compact` 更长(固定 yes/no system + 官方前缀模板),因此 **absolute 延迟仍高于 `compact`**,但相对旧版 **standard** 行降幅与 **compact** 同量级(上表)。
  
  **Takeaway:** Under T4 + vLLM 0.18 score path, **auto attention (FLASHINFER)** plus **`compact` default** brings `qwen3_vllm_score` much closer to `qwen3_vllm` timings from the baseline matrix; re-run the full 4-way matrix if you need refreshed `qwen3_vllm` rows on the same commit.