Blame view

perf_reports/reranker_vllm_instruction/2026-03-25/RESULTS.md 7.03 KB
52ea6529   tangwang   性能测试:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
  # Reranker benchmark: `qwen3_vllm` vs `qwen3_vllm_score` × `instruction_format`
  
  **Date:** 2026-03-25  
  **Host:** single GPU (Tesla T4, ~16 GiB), CUDA 12.8 (see `nvidia-smi` during run).
  
  ## Configuration (from `config/config.yaml`)
  
  Shared across both backends for this run:
  
  | Key | Value |
  |-----|-------|
  | `model_name` | `Qwen/Qwen3-Reranker-0.6B` |
  | `max_model_len` | 160 |
  | `infer_batch_size` | 100 |
  | `sort_by_doc_length` | true |
  | `enable_prefix_caching` | true |
  | `enforce_eager` | false |
  | `dtype` | float16 |
  | `tensor_parallel_size` | 1 |
  | `gpu_memory_utilization` | 0.20 |
  | `instruction` | `Rank products by query with category & style match prioritized` |
  
  `qwen3_vllm` uses vLLM **generate + logprobs** (`.venv-reranker`).  
  `qwen3_vllm_score` uses vLLM **`LLM.score()`** (`.venv-reranker-score`, pinned vLLM stack per `reranker/README.md`).
  
  ## Methodology
  
  - Script: `python scripts/benchmark_reranker_random_titles.py 100,200,400,600,800,1000 --repeat 5` with **`--seed 99`** (see note below), **`--quiet-runs`**, **`--timeout 360`**.
  - Titles: default file `/home/ubuntu/rerank_test/titles.1.8w` (one title per line).
  - Query: default `健身女生T恤短袖`.
  - Each scenario: **3 warm-up** requests at `n=400` (not timed), then **5 timed** runs per `n`.
  - Metric: **client wall time** for `POST /rerank` (localhost), milliseconds.
  - After each `services.rerank.backend` / `instruction_format` change: `./restart.sh reranker`, then **`GET /health`** until `backend` and `instruction_format` matched the intended scenario (extended `reranker/server.py` to expose `instruction_format` when the backend defines `_instruction_format`).
  
  **Note on RNG seed:** With `--seed 42`, some runs occasionally lost one sample at `n=600` (non-200 or transport error). All figures below use **`--seed 99`** so every cell has **5/5** successful runs and comparable sampled titles.
  
  ## Raw artifacts
  
  JSON aggregates (means, stdev, raw `values_ms`): same directory, `qwen3_vllm_{compact,standard}.json`, `qwen3_vllm_score_{compact,standard}.json`.
  
  ## Results — mean latency (ms)
  
  | backend | instruction_format | n=100 | n=200 | n=400 | n=600 | n=800 | n=1000 |
  |---------|-------------------|------:|------:|------:|------:|------:|-------:|
  | `qwen3_vllm` | `compact` | 213.5 | 418.0 | 861.4 | 1263.4 | 1744.3 | 2162.2 |
  | `qwen3_vllm` | `standard` | 254.9 | 475.4 | 909.7 | 1353.2 | 1912.5 | 2406.7 |
  | `qwen3_vllm_score` | `compact` | 239.2 | 480.2 | 966.2 | 1433.5 | 1937.2 | 2428.4 |
  | `qwen3_vllm_score` | `standard` | 299.6 | 591.8 | 1178.9 | 1773.7 | 2341.6 | 2931.7 |
  
  ## Short interpretation
  
  1. **`compact` vs `standard`:** For both backends, **`compact` is faster** on this setup (shorter / different chat template vs fixed yes/no system prompt + user block — see `reranker/backends/qwen3_vllm.py` / `qwen3_vllm_score.py`).
  2. **`qwen3_vllm` vs `qwen3_vllm_score`:** At **`n=1000`**, **`qwen3_vllm` + `compact`** is the fastest row (~2162 ms mean); **`qwen3_vllm_score` + `standard`** is the slowest (~2932 ms). Ordering can change on other GPUs / vLLM versions / batching.
b0972ff9   tangwang   qwen3_vllm_score ...
54
  3. **Repo / 运维默认(当前)**`services.rerank.backend` 多为 `qwen3_vllm_score`;**score** 块推荐 **`instruction_format: compact`**(与后端代码默认值一致)。`qwen3_vllm` 块的 `instruction_format` 可与 generate 后端单独配置,不必与 score 强制相同。
52ea6529   tangwang   性能测试:
55
56
57
58
59
60
61
  
  ## Tooling added / changed
  
  - `reranker/server.py`: `/health` includes `instruction_format` when the active backend sets `_instruction_format`.
  - `scripts/benchmark_reranker_random_titles.py`: `--tag`, `--json-summary-out`, `--quiet-runs`.
  - `scripts/patch_rerank_vllm_benchmark_config.py`: surgical YAML patch (preserves newlines).
  - `scripts/run_reranker_vllm_instruction_benchmark.sh`: full matrix driver (continues if a benchmark exits non-zero; uses `--timeout 360`).
b0972ff9   tangwang   qwen3_vllm_score ...
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
  
  ---
  
  ## Addendum: `qwen3_vllm_score` after attention auto-select (FLASHINFER on T4)
  
  **Do not replace the table above** — it records the **older** `qwen3_vllm_score` behaviour (roughly: sm<8 时向 vLLM 注入 `attention_config` / `TRITON_ATTN`,且代码里 `instruction_format` 默认曾为 `standard`).
  
  ### What changed in code / ops
  
  | Area | Before (baseline table) | After (this addendum) |
  |------|-------------------------|------------------------|
  | Attention | Backend forced / steered attention on T4 (e.g. `TRITON_ATTN` path) | **No** `attention_config` in `LLM(...)`; vLLM **auto** — on this T4 run, logs show **`FLASHINFER`** |
  | Config surface | `vllm_attention_backend` / `RERANK_VLLM_ATTENTION_BACKEND` 等 | **Removed**(少 YAML/环境变量分支,逻辑收敛) |
  | Code default `instruction_format` | `qwen3_vllm_score` 默认 `standard` | 与 `qwen3_vllm` 对齐为 **`compact`**(仍可在 YAML 写 `standard`) |
  | Smoke / 启动 | — | `scripts/smoke_qwen3_vllm_score_backend.py`;`scripts/start_reranker.sh` 将 **venv `bin` 置于 `PATH`**(FLASHINFER JIT 依赖 venv 内的 `ninja`) |
  
  Micro-benchmark (same machine, isolated): **~927.5 ms → ~673.1 ms** at **n=400** docs on `LLM.score()` steady state (~**28%**), after removing the forced attention path and letting vLLM pick **FLASHINFER**.
  
  ### Re-benchmark (HTTP `POST /rerank`, same methodology as §Methodology)
  
  - **Purpose:** Same comparison axis as the main table (`qwen3_vllm_score` only), **after** the FLASHINFER-friendly backend.
  - **Controlled for `max_model_len`:** `services.rerank.backends.qwen3_vllm_score.max_model_len` set to **160** for this run so numbers are comparable to the **baseline** rows (also 160). Production `config.yaml` may use a different value (e.g. **196**); adjust YAML before repeating the benchmark if you need prod-shaped latency.
  - **Seed / repeats:** `--seed 99`, `--repeat 5`, same script and title file as §Methodology.
  - **Artifacts:** `qwen3_vllm_score_compact_post_flashinfer_opt.json`, `qwen3_vllm_score_standard_post_flashinfer_opt.json`.
  
  #### `qwen3_vllm_score` — mean latency (ms), post optimization
  
  | instruction_format | n=100 | n=200 | n=400 | n=600 | n=800 | n=1000 | vs baseline same row (approx.) |
  |--------------------|------:|------:|------:|------:|------:|-------:|--------------------------------|
  | `compact` | 178.5 | 351.7 | **688.2** | 1024.0 | 1375.8 | **1752.4** | e.g. n=400 **−28.8%**, n=1000 **−27.8%** vs 966.2 / 2428.4 |
e38dc1be   tangwang   融合公式参数调整、以及展示信息优化
92
  | `standard` | 198.4 | 386.4 | **762.83** | 1174.6 | 1540.10 | **1970.14** | e.g. n=400 **−33.9%**, n=1000 **−33.3%** vs 1178.9 / 2931.7 |
b0972ff9   tangwang   qwen3_vllm_score ...
93
94
95
96
  
  **`instruction_format: standard` 的优化点(本版):**`compact` **共享**同一套 vLLM attention 自动选择;不再在 T4 上单独锁死 `TRITON_ATTN`。Prompt 仍比 `compact` 更长(固定 yes/no system + 官方前缀模板),因此 **absolute 延迟仍高于 `compact`**,但相对旧版 **standard** 行降幅与 **compact** 同量级(上表)。
  
  **Takeaway:** Under T4 + vLLM 0.18 score path, **auto attention (FLASHINFER)** plus **`compact` default** brings `qwen3_vllm_score` much closer to `qwen3_vllm` timings from the baseline matrix; re-run the full 4-way matrix if you need refreshed `qwen3_vllm` rows on the same commit.