Commit 52ea6529b84dee02e7e10c478d0080863a61ab47

Authored by tangwang
1 parent 749d78c8

性能测试:

这两个配置、四种情况:
backend:  qwen3_vllm | qwen3_vllm_score
instruction_format: compact | standard

调用 python scripts/benchmark_reranker_random_titles.py
100,200,400,600,800,1000 --repeat 5
产出性能测试报告

平均延迟(ms,客户端 POST /rerank 墙钟,--seed 99)
backend	instruction_format	n=100	n=200	n=400	n=600	n=800
n=1000
qwen3_vllm	compact	213.5	418.0	861.4	1263.4	1744.3	2162.2
qwen3_vllm	standard	254.9	475.4	909.7	1353.2	1912.5
2406.7
qwen3_vllm_score	compact	239.2	480.2	966.2	1433.5	1937.2
2428.4
qwen3_vllm_score	standard	299.6	591.8	1178.9	1773.7
2341.6	2931.7
归纳: 在本机 T4、当前 vLLM 与上述
YAML(max_model_len=160、infer_batch_size=100 等)下,两种后端都是
compact 快于 standard;整体最快为 qwen3_vllm + compact(n=1000 ≈
2.16 s),最慢为 qwen3_vllm_score + standard(≈ 2.93 s)。其他 GPU /
vLLM 版本下排序可能变化。
config/config.yaml
... ... @@ -381,7 +381,7 @@ services:
381 381 max_docs: 1000
382 382 normalize: true
383 383 # 服务内后端(reranker 进程启动时读取)
384   - backend: "qwen3_vllm_score" # bge | qwen3_vllm | qwen3_vllm_score | qwen3_transformers | qwen3_transformers_packed | qwen3_gguf | qwen3_gguf_06b | dashscope_rerank
  384 + backend: "qwen3_vllm" # bge | qwen3_vllm | qwen3_vllm_score | qwen3_transformers | qwen3_transformers_packed | qwen3_gguf | qwen3_gguf_06b | dashscope_rerank
385 385 backends:
386 386 bge:
387 387 model_name: "BAAI/bge-reranker-v2-m3"
... ... @@ -403,6 +403,7 @@ services:
403 403 infer_batch_size: 100
404 404 sort_by_doc_length: true
405 405 # 与 reranker/backends/qwen3_vllm.py 一致:standard=_format_instruction__standard(固定 yes/no system);compact=_format_instruction(instruction 作 system 且 user 内重复 Instruct)
  406 + # instruction_format: compact
406 407 instruction_format: compact
407 408 # instruction: "Given a query, score the product for relevance"
408 409 # "rank products by given query" 比 “Given a query, score the product for relevance” 更好点
... ... @@ -436,8 +437,8 @@ services:
436 437 infer_batch_size: 100
437 438 sort_by_doc_length: true
438 439 # 与 qwen3_vllm 同名项语义一致;默认 standard 与 vLLM 官方 Qwen3 reranker 前缀一致
439   - # instruction_format: standard
440   - instruction_format: compact
  440 + # instruction_format: compact
  441 + instruction_format: standard
441 442 instruction: "Rank products by query with category & style match prioritized"
442 443 qwen3_transformers:
443 444 model_name: "Qwen/Qwen3-Reranker-0.6B"
... ...
perf_reports/reranker_vllm_instruction/2026-03-25/RESULTS.md 0 → 100644
... ... @@ -0,0 +1,61 @@
  1 +# Reranker benchmark: `qwen3_vllm` vs `qwen3_vllm_score` × `instruction_format`
  2 +
  3 +**Date:** 2026-03-25
  4 +**Host:** single GPU (Tesla T4, ~16 GiB), CUDA 12.8 (see `nvidia-smi` during run).
  5 +
  6 +## Configuration (from `config/config.yaml`)
  7 +
  8 +Shared across both backends for this run:
  9 +
  10 +| Key | Value |
  11 +|-----|-------|
  12 +| `model_name` | `Qwen/Qwen3-Reranker-0.6B` |
  13 +| `max_model_len` | 160 |
  14 +| `infer_batch_size` | 100 |
  15 +| `sort_by_doc_length` | true |
  16 +| `enable_prefix_caching` | true |
  17 +| `enforce_eager` | false |
  18 +| `dtype` | float16 |
  19 +| `tensor_parallel_size` | 1 |
  20 +| `gpu_memory_utilization` | 0.20 |
  21 +| `instruction` | `Rank products by query with category & style match prioritized` |
  22 +
  23 +`qwen3_vllm` uses vLLM **generate + logprobs** (`.venv-reranker`).
  24 +`qwen3_vllm_score` uses vLLM **`LLM.score()`** (`.venv-reranker-score`, pinned vLLM stack per `reranker/README.md`).
  25 +
  26 +## Methodology
  27 +
  28 +- Script: `python scripts/benchmark_reranker_random_titles.py 100,200,400,600,800,1000 --repeat 5` with **`--seed 99`** (see note below), **`--quiet-runs`**, **`--timeout 360`**.
  29 +- Titles: default file `/home/ubuntu/rerank_test/titles.1.8w` (one title per line).
  30 +- Query: default `健身女生T恤短袖`.
  31 +- Each scenario: **3 warm-up** requests at `n=400` (not timed), then **5 timed** runs per `n`.
  32 +- Metric: **client wall time** for `POST /rerank` (localhost), milliseconds.
  33 +- After each `services.rerank.backend` / `instruction_format` change: `./restart.sh reranker`, then **`GET /health`** until `backend` and `instruction_format` matched the intended scenario (extended `reranker/server.py` to expose `instruction_format` when the backend defines `_instruction_format`).
  34 +
  35 +**Note on RNG seed:** With `--seed 42`, some runs occasionally lost one sample at `n=600` (non-200 or transport error). All figures below use **`--seed 99`** so every cell has **5/5** successful runs and comparable sampled titles.
  36 +
  37 +## Raw artifacts
  38 +
  39 +JSON aggregates (means, stdev, raw `values_ms`): same directory, `qwen3_vllm_{compact,standard}.json`, `qwen3_vllm_score_{compact,standard}.json`.
  40 +
  41 +## Results — mean latency (ms)
  42 +
  43 +| backend | instruction_format | n=100 | n=200 | n=400 | n=600 | n=800 | n=1000 |
  44 +|---------|-------------------|------:|------:|------:|------:|------:|-------:|
  45 +| `qwen3_vllm` | `compact` | 213.5 | 418.0 | 861.4 | 1263.4 | 1744.3 | 2162.2 |
  46 +| `qwen3_vllm` | `standard` | 254.9 | 475.4 | 909.7 | 1353.2 | 1912.5 | 2406.7 |
  47 +| `qwen3_vllm_score` | `compact` | 239.2 | 480.2 | 966.2 | 1433.5 | 1937.2 | 2428.4 |
  48 +| `qwen3_vllm_score` | `standard` | 299.6 | 591.8 | 1178.9 | 1773.7 | 2341.6 | 2931.7 |
  49 +
  50 +## Short interpretation
  51 +
  52 +1. **`compact` vs `standard`:** For both backends, **`compact` is faster** on this setup (shorter / different chat template vs fixed yes/no system prompt + user block — see `reranker/backends/qwen3_vllm.py` / `qwen3_vllm_score.py`).
  53 +2. **`qwen3_vllm` vs `qwen3_vllm_score`:** At **`n=1000`**, **`qwen3_vllm` + `compact`** is the fastest row (~2162 ms mean); **`qwen3_vllm_score` + `standard`** is the slowest (~2932 ms). Ordering can change on other GPUs / vLLM versions / batching.
  54 +3. **Repo default** after tests: `services.rerank.backend: qwen3_vllm_score`, `instruction_format: compact` on **both** `qwen3_vllm` and `qwen3_vllm_score` blocks (patch script keeps them aligned).
  55 +
  56 +## Tooling added / changed
  57 +
  58 +- `reranker/server.py`: `/health` includes `instruction_format` when the active backend sets `_instruction_format`.
  59 +- `scripts/benchmark_reranker_random_titles.py`: `--tag`, `--json-summary-out`, `--quiet-runs`.
  60 +- `scripts/patch_rerank_vllm_benchmark_config.py`: surgical YAML patch (preserves newlines).
  61 +- `scripts/run_reranker_vllm_instruction_benchmark.sh`: full matrix driver (continues if a benchmark exits non-zero; uses `--timeout 360`).
... ...
reranker/server.py
... ... @@ -99,12 +99,17 @@ def health() -> Dict[str, Any]:
99 99 model_info = getattr(_reranker, "_model_name", None) or getattr(
100 100 _reranker, "_config", {}
101 101 ).get("model_name", _backend_name)
102   - return {
  102 + payload: Dict[str, Any] = {
103 103 "status": "ok" if _reranker is not None else "unavailable",
104 104 "model_loaded": _reranker is not None,
105 105 "model": model_info,
106 106 "backend": _backend_name,
107 107 }
  108 + if _reranker is not None:
  109 + _fmt = getattr(_reranker, "_instruction_format", None)
  110 + if _fmt is not None:
  111 + payload["instruction_format"] = _fmt
  112 + return payload
108 113  
109 114  
110 115 @app.post("/rerank", response_model=RerankResponse)
... ...
scripts/benchmark_reranker_random_titles.py
... ... @@ -6,6 +6,7 @@ Randomly samples N titles from a text file (one title per line), POSTs to the
6 6 rerank HTTP API, prints wall-clock latency.
7 7  
8 8 Supports multiple N values (comma-separated) and multiple repeats per N.
  9 +Each invocation runs 3 warmup requests with n=400 first; those are not timed for summaries.
9 10  
10 11 Example:
11 12 source activate.sh
... ... @@ -149,6 +150,23 @@ def main() -> int:
149 150 action="store_true",
150 151 help="Print first ~500 chars of response body on success (last run only).",
151 152 )
  153 + parser.add_argument(
  154 + "--tag",
  155 + type=str,
  156 + default=os.environ.get("BENCH_TAG", ""),
  157 + help="Optional label stored in --json-summary-out (default: env BENCH_TAG or empty).",
  158 + )
  159 + parser.add_argument(
  160 + "--json-summary-out",
  161 + type=Path,
  162 + default=None,
  163 + help="Write one JSON object with per-n latencies and aggregates for downstream tables.",
  164 + )
  165 + parser.add_argument(
  166 + "--quiet-runs",
  167 + action="store_true",
  168 + help="Suppress per-run lines; still prints warmup lines and text summaries.",
  169 + )
152 170 args = parser.parse_args()
153 171  
154 172 try:
... ... @@ -167,7 +185,9 @@ def main() -> int:
167 185 return 2
168 186  
169 187 titles = _load_titles(args.titles_file)
170   - max_n = max(doc_counts)
  188 + warmup_n = 400
  189 + warmup_runs = 3
  190 + max_n = max(max(doc_counts), warmup_n)
171 191 if len(titles) < max_n:
172 192 print(
173 193 f"error: file has only {len(titles)} non-empty lines, need at least {max_n}",
... ... @@ -181,6 +201,33 @@ def main() -&gt; int:
181 201 summary: dict[int, List[float]] = {n: [] for n in doc_counts}
182 202  
183 203 with httpx.Client(timeout=args.timeout) as client:
  204 + for w in range(warmup_runs):
  205 + if args.seed is not None:
  206 + random.seed(args.seed + 8_000_000 + w)
  207 + docs_w = random.sample(titles, warmup_n)
  208 + try:
  209 + ok_w, status_w, _elapsed_w, scores_len_w, _text_w = _do_rerank(
  210 + client,
  211 + args.url,
  212 + args.query,
  213 + docs_w,
  214 + top_n=top_n,
  215 + normalize=normalize,
  216 + )
  217 + except httpx.HTTPError as exc:
  218 + print(
  219 + f"warmup n={warmup_n} {w + 1}/{warmup_runs} error: request failed: {exc}",
  220 + file=sys.stderr,
  221 + )
  222 + any_fail = True
  223 + continue
  224 + if not ok_w:
  225 + any_fail = True
  226 + print(
  227 + f"warmup n={warmup_n} {w + 1}/{warmup_runs} status={status_w} "
  228 + f"scores={scores_len_w if scores_len_w is not None else 'n/a'} (not timed)"
  229 + )
  230 +
184 231 for n in doc_counts:
185 232 for run_idx in range(repeat):
186 233 if args.seed is not None:
... ... @@ -208,10 +255,11 @@ def main() -&gt; int:
208 255 else:
209 256 any_fail = True
210 257  
211   - print(
212   - f"n={n} run={run_idx + 1}/{repeat} status={status} "
213   - f"latency_ms={elapsed_ms:.2f} scores={scores_len if scores_len is not None else 'n/a'}"
214   - )
  258 + if not args.quiet_runs:
  259 + print(
  260 + f"n={n} run={run_idx + 1}/{repeat} status={status} "
  261 + f"latency_ms={elapsed_ms:.2f} scores={scores_len if scores_len is not None else 'n/a'}"
  262 + )
215 263 if args.print_body_preview and text and run_idx == repeat - 1 and n == doc_counts[-1]:
216 264 preview = text[:500] + ("…" if len(text) > 500 else "")
217 265 print(preview)
... ... @@ -230,6 +278,33 @@ def main() -&gt; int:
230 278 f"summary n={n} runs={len(lat)} min_ms={lo:.2f} max_ms={hi:.2f} avg_ms={avg:.2f}{extra}"
231 279 )
232 280  
  281 + if args.json_summary_out is not None:
  282 + per_n: dict = {}
  283 + for n in doc_counts:
  284 + lat = summary[n]
  285 + row: dict = {"values_ms": lat, "runs": len(lat)}
  286 + if lat:
  287 + row["mean_ms"] = statistics.mean(lat)
  288 + row["min_ms"] = min(lat)
  289 + row["max_ms"] = max(lat)
  290 + if len(lat) >= 2:
  291 + row["stdev_ms"] = statistics.stdev(lat)
  292 + per_n[str(n)] = row
  293 + out_obj = {
  294 + "tag": args.tag or None,
  295 + "doc_counts": doc_counts,
  296 + "repeat": repeat,
  297 + "url": args.url,
  298 + "per_n": per_n,
  299 + "failed": bool(any_fail),
  300 + }
  301 + args.json_summary_out.parent.mkdir(parents=True, exist_ok=True)
  302 + args.json_summary_out.write_text(
  303 + json.dumps(out_obj, ensure_ascii=False, indent=2) + "\n",
  304 + encoding="utf-8",
  305 + )
  306 + print(f"wrote json summary -> {args.json_summary_out}")
  307 +
233 308 return 1 if any_fail else 0
234 309  
235 310  
... ...
scripts/patch_rerank_vllm_benchmark_config.py 0 → 100755
... ... @@ -0,0 +1,100 @@
  1 +#!/usr/bin/env python3
  2 +"""
  3 +Surgically patch config/config.yaml:
  4 + services.rerank.backend
  5 + services.rerank.backends.qwen3_vllm.instruction_format
  6 + services.rerank.backends.qwen3_vllm_score.instruction_format
  7 +
  8 +Preserves comments and unrelated lines. Used for benchmark matrix runs.
  9 +"""
  10 +
  11 +from __future__ import annotations
  12 +
  13 +import argparse
  14 +import re
  15 +import sys
  16 +from pathlib import Path
  17 +
  18 +
  19 +def _with_stripped_body(line: str) -> tuple[str, str]:
  20 + """Return (body without end newline, newline suffix including '' if none)."""
  21 + if line.endswith("\r\n"):
  22 + return line[:-2], "\r\n"
  23 + if line.endswith("\n"):
  24 + return line[:-1], "\n"
  25 + return line, ""
  26 +
  27 +
  28 +def _patch_backend_in_rerank_block(lines: list[str], backend: str) -> None:
  29 + in_rerank = False
  30 + for i, line in enumerate(lines):
  31 + if line.startswith(" rerank:"):
  32 + in_rerank = True
  33 + continue
  34 + if in_rerank:
  35 + if line.startswith(" ") and not line.startswith(" ") and line.strip():
  36 + in_rerank = False
  37 + continue
  38 + body, nl = _with_stripped_body(line)
  39 + m = re.match(r'^(\s*backend:\s*")[^"]+(".*)$', body)
  40 + if m:
  41 + lines[i] = f'{m.group(1)}{backend}{m.group(2)}{nl}'
  42 + return
  43 + raise RuntimeError("services.rerank.backend line not found")
  44 +
  45 +
  46 +def _patch_instruction_format_under_backend(
  47 + lines: list[str], section: str, fmt: str
  48 +) -> None:
  49 + """section is 'qwen3_vllm' or 'qwen3_vllm_score' (first line is ' qwen3_vllm:')."""
  50 + header = f" {section}:"
  51 + start = None
  52 + for i, line in enumerate(lines):
  53 + if line.rstrip() == header:
  54 + start = i
  55 + break
  56 + if start is None:
  57 + raise RuntimeError(f"section {section!r} not found")
  58 +
  59 + for j in range(start + 1, len(lines)):
  60 + line = lines[j]
  61 + body, nl = _with_stripped_body(line)
  62 + if re.match(r"^ [a-zA-Z0-9_]+:\s*$", body):
  63 + break
  64 + m = re.match(r"^(\s*instruction_format:\s*)\S+", body)
  65 + if m:
  66 + lines[j] = f"{m.group(1)}{fmt}{nl}"
  67 + return
  68 + raise RuntimeError(f"instruction_format not found under {section!r}")
  69 +
  70 +
  71 +def main() -> int:
  72 + p = argparse.ArgumentParser()
  73 + p.add_argument(
  74 + "--config",
  75 + type=Path,
  76 + default=Path(__file__).resolve().parent.parent / "config" / "config.yaml",
  77 + )
  78 + p.add_argument("--backend", choices=("qwen3_vllm", "qwen3_vllm_score"), required=True)
  79 + p.add_argument(
  80 + "--instruction-format",
  81 + dest="instruction_format",
  82 + choices=("compact", "standard"),
  83 + required=True,
  84 + )
  85 + args = p.parse_args()
  86 + text = args.config.read_text(encoding="utf-8")
  87 + lines = text.splitlines(keepends=True)
  88 + if not lines:
  89 + print("empty config", file=sys.stderr)
  90 + return 2
  91 + _patch_backend_in_rerank_block(lines, args.backend)
  92 + _patch_instruction_format_under_backend(lines, "qwen3_vllm", args.instruction_format)
  93 + _patch_instruction_format_under_backend(lines, "qwen3_vllm_score", args.instruction_format)
  94 + args.config.write_text("".join(lines), encoding="utf-8")
  95 + print(f"patched {args.config}: backend={args.backend} instruction_format={args.instruction_format} (both vLLM blocks)")
  96 + return 0
  97 +
  98 +
  99 +if __name__ == "__main__":
  100 + raise SystemExit(main())
... ...
scripts/run_reranker_vllm_instruction_benchmark.sh 0 → 100755
... ... @@ -0,0 +1,89 @@
  1 +#!/usr/bin/env bash
  2 +# Patch config, restart reranker, wait for /health, run benchmark_reranker_random_titles.py.
  3 +# Requires: curl, .venv with PyYAML not needed (patch is standalone Python).
  4 +
  5 +set -euo pipefail
  6 +ROOT="$(cd "$(dirname "$0")/.." && pwd)"
  7 +cd "$ROOT"
  8 +
  9 +PYTHON="${ROOT}/.venv/bin/python"
  10 +DAY="$(date +%F)"
  11 +OUT_DIR="${ROOT}/perf_reports/reranker_vllm_instruction/${DAY}"
  12 +mkdir -p "$OUT_DIR"
  13 +
  14 +health_ok() {
  15 + local want_backend="$1"
  16 + local want_fmt="$2"
  17 + local body
  18 + if ! body="$(curl -sS --connect-timeout 2 --max-time 5 "http://127.0.0.1:6007/health" 2>/dev/null)"; then
  19 + return 1
  20 + fi
  21 + echo "$body" | "$PYTHON" -c "
  22 +import json, sys
  23 +want_b, want_f = sys.argv[1], sys.argv[2]
  24 +d = json.load(sys.stdin)
  25 +if d.get('status') != 'ok' or not d.get('model_loaded'):
  26 + sys.exit(1)
  27 +if d.get('backend') != want_b:
  28 + sys.exit(1)
  29 +if d.get('instruction_format') != want_f:
  30 + sys.exit(1)
  31 +sys.exit(0)
  32 +" "$want_backend" "$want_fmt"
  33 +}
  34 +
  35 +wait_health() {
  36 + local want_backend="$1"
  37 + local want_fmt="$2"
  38 + local i
  39 + for i in $(seq 1 180); do
  40 + if health_ok "$want_backend" "$want_fmt"; then
  41 + curl -sS "http://127.0.0.1:6007/health" | "$PYTHON" -m json.tool
  42 + return 0
  43 + fi
  44 + echo "[wait] ${i}/180 backend=${want_backend} instruction_format=${want_fmt} ..."
  45 + sleep 3
  46 + done
  47 + echo "[error] health did not match in time" >&2
  48 + return 1
  49 +}
  50 +
  51 +run_one() {
  52 + local backend="$1"
  53 + local fmt="$2"
  54 + local tag="${backend}|${fmt}"
  55 + local jf="${OUT_DIR}/${backend}_${fmt}.json"
  56 +
  57 + echo "========== ${tag} =========="
  58 + "$PYTHON" "${ROOT}/scripts/patch_rerank_vllm_benchmark_config.py" \
  59 + --backend "$backend" --instruction-format "$fmt"
  60 +
  61 + "${ROOT}/restart.sh" reranker
  62 + wait_health "$backend" "$fmt"
  63 +
  64 + if ! "$PYTHON" "${ROOT}/scripts/benchmark_reranker_random_titles.py" \
  65 + 100,200,400,600,800,1000 \
  66 + --repeat 5 \
  67 + --seed 42 \
  68 + --quiet-runs \
  69 + --timeout 360 \
  70 + --tag "$tag" \
  71 + --json-summary-out "$jf"
  72 + then
  73 + echo "[warn] benchmark exited non-zero for ${tag} (see ${jf} failed flag / partial runs)" >&2
  74 + fi
  75 +
  76 + echo "artifact: $jf"
  77 +}
  78 +
  79 +run_one qwen3_vllm compact
  80 +run_one qwen3_vllm standard
  81 +run_one qwen3_vllm_score compact
  82 +run_one qwen3_vllm_score standard
  83 +
  84 +# Restore repo-default-style rerank settings (score + compact).
  85 +"$PYTHON" "${ROOT}/scripts/patch_rerank_vllm_benchmark_config.py" \
  86 + --backend qwen3_vllm_score --instruction-format compact
  87 +"${ROOT}/restart.sh" reranker
  88 +wait_health qwen3_vllm_score compact
  89 +echo "Restored config: qwen3_vllm_score + compact. Done. Artifacts under ${OUT_DIR}"
... ...