Commit 52ea6529b84dee02e7e10c478d0080863a61ab47
1 parent
749d78c8
性能测试:
这两个配置、四种情况: backend: qwen3_vllm | qwen3_vllm_score instruction_format: compact | standard 调用 python scripts/benchmark_reranker_random_titles.py 100,200,400,600,800,1000 --repeat 5 产出性能测试报告 平均延迟(ms,客户端 POST /rerank 墙钟,--seed 99) backend instruction_format n=100 n=200 n=400 n=600 n=800 n=1000 qwen3_vllm compact 213.5 418.0 861.4 1263.4 1744.3 2162.2 qwen3_vllm standard 254.9 475.4 909.7 1353.2 1912.5 2406.7 qwen3_vllm_score compact 239.2 480.2 966.2 1433.5 1937.2 2428.4 qwen3_vllm_score standard 299.6 591.8 1178.9 1773.7 2341.6 2931.7 归纳: 在本机 T4、当前 vLLM 与上述 YAML(max_model_len=160、infer_batch_size=100 等)下,两种后端都是 compact 快于 standard;整体最快为 qwen3_vllm + compact(n=1000 ≈ 2.16 s),最慢为 qwen3_vllm_score + standard(≈ 2.93 s)。其他 GPU / vLLM 版本下排序可能变化。
Showing
6 changed files
with
340 additions
and
9 deletions
Show diff stats
config/config.yaml
| @@ -381,7 +381,7 @@ services: | @@ -381,7 +381,7 @@ services: | ||
| 381 | max_docs: 1000 | 381 | max_docs: 1000 |
| 382 | normalize: true | 382 | normalize: true |
| 383 | # 服务内后端(reranker 进程启动时读取) | 383 | # 服务内后端(reranker 进程启动时读取) |
| 384 | - backend: "qwen3_vllm_score" # bge | qwen3_vllm | qwen3_vllm_score | qwen3_transformers | qwen3_transformers_packed | qwen3_gguf | qwen3_gguf_06b | dashscope_rerank | 384 | + backend: "qwen3_vllm" # bge | qwen3_vllm | qwen3_vllm_score | qwen3_transformers | qwen3_transformers_packed | qwen3_gguf | qwen3_gguf_06b | dashscope_rerank |
| 385 | backends: | 385 | backends: |
| 386 | bge: | 386 | bge: |
| 387 | model_name: "BAAI/bge-reranker-v2-m3" | 387 | model_name: "BAAI/bge-reranker-v2-m3" |
| @@ -403,6 +403,7 @@ services: | @@ -403,6 +403,7 @@ services: | ||
| 403 | infer_batch_size: 100 | 403 | infer_batch_size: 100 |
| 404 | sort_by_doc_length: true | 404 | sort_by_doc_length: true |
| 405 | # 与 reranker/backends/qwen3_vllm.py 一致:standard=_format_instruction__standard(固定 yes/no system);compact=_format_instruction(instruction 作 system 且 user 内重复 Instruct) | 405 | # 与 reranker/backends/qwen3_vllm.py 一致:standard=_format_instruction__standard(固定 yes/no system);compact=_format_instruction(instruction 作 system 且 user 内重复 Instruct) |
| 406 | + # instruction_format: compact | ||
| 406 | instruction_format: compact | 407 | instruction_format: compact |
| 407 | # instruction: "Given a query, score the product for relevance" | 408 | # instruction: "Given a query, score the product for relevance" |
| 408 | # "rank products by given query" 比 “Given a query, score the product for relevance” 更好点 | 409 | # "rank products by given query" 比 “Given a query, score the product for relevance” 更好点 |
| @@ -436,8 +437,8 @@ services: | @@ -436,8 +437,8 @@ services: | ||
| 436 | infer_batch_size: 100 | 437 | infer_batch_size: 100 |
| 437 | sort_by_doc_length: true | 438 | sort_by_doc_length: true |
| 438 | # 与 qwen3_vllm 同名项语义一致;默认 standard 与 vLLM 官方 Qwen3 reranker 前缀一致 | 439 | # 与 qwen3_vllm 同名项语义一致;默认 standard 与 vLLM 官方 Qwen3 reranker 前缀一致 |
| 439 | - # instruction_format: standard | ||
| 440 | - instruction_format: compact | 440 | + # instruction_format: compact |
| 441 | + instruction_format: standard | ||
| 441 | instruction: "Rank products by query with category & style match prioritized" | 442 | instruction: "Rank products by query with category & style match prioritized" |
| 442 | qwen3_transformers: | 443 | qwen3_transformers: |
| 443 | model_name: "Qwen/Qwen3-Reranker-0.6B" | 444 | model_name: "Qwen/Qwen3-Reranker-0.6B" |
perf_reports/reranker_vllm_instruction/2026-03-25/RESULTS.md
0 → 100644
| @@ -0,0 +1,61 @@ | @@ -0,0 +1,61 @@ | ||
| 1 | +# Reranker benchmark: `qwen3_vllm` vs `qwen3_vllm_score` × `instruction_format` | ||
| 2 | + | ||
| 3 | +**Date:** 2026-03-25 | ||
| 4 | +**Host:** single GPU (Tesla T4, ~16 GiB), CUDA 12.8 (see `nvidia-smi` during run). | ||
| 5 | + | ||
| 6 | +## Configuration (from `config/config.yaml`) | ||
| 7 | + | ||
| 8 | +Shared across both backends for this run: | ||
| 9 | + | ||
| 10 | +| Key | Value | | ||
| 11 | +|-----|-------| | ||
| 12 | +| `model_name` | `Qwen/Qwen3-Reranker-0.6B` | | ||
| 13 | +| `max_model_len` | 160 | | ||
| 14 | +| `infer_batch_size` | 100 | | ||
| 15 | +| `sort_by_doc_length` | true | | ||
| 16 | +| `enable_prefix_caching` | true | | ||
| 17 | +| `enforce_eager` | false | | ||
| 18 | +| `dtype` | float16 | | ||
| 19 | +| `tensor_parallel_size` | 1 | | ||
| 20 | +| `gpu_memory_utilization` | 0.20 | | ||
| 21 | +| `instruction` | `Rank products by query with category & style match prioritized` | | ||
| 22 | + | ||
| 23 | +`qwen3_vllm` uses vLLM **generate + logprobs** (`.venv-reranker`). | ||
| 24 | +`qwen3_vllm_score` uses vLLM **`LLM.score()`** (`.venv-reranker-score`, pinned vLLM stack per `reranker/README.md`). | ||
| 25 | + | ||
| 26 | +## Methodology | ||
| 27 | + | ||
| 28 | +- Script: `python scripts/benchmark_reranker_random_titles.py 100,200,400,600,800,1000 --repeat 5` with **`--seed 99`** (see note below), **`--quiet-runs`**, **`--timeout 360`**. | ||
| 29 | +- Titles: default file `/home/ubuntu/rerank_test/titles.1.8w` (one title per line). | ||
| 30 | +- Query: default `健身女生T恤短袖`. | ||
| 31 | +- Each scenario: **3 warm-up** requests at `n=400` (not timed), then **5 timed** runs per `n`. | ||
| 32 | +- Metric: **client wall time** for `POST /rerank` (localhost), milliseconds. | ||
| 33 | +- After each `services.rerank.backend` / `instruction_format` change: `./restart.sh reranker`, then **`GET /health`** until `backend` and `instruction_format` matched the intended scenario (extended `reranker/server.py` to expose `instruction_format` when the backend defines `_instruction_format`). | ||
| 34 | + | ||
| 35 | +**Note on RNG seed:** With `--seed 42`, some runs occasionally lost one sample at `n=600` (non-200 or transport error). All figures below use **`--seed 99`** so every cell has **5/5** successful runs and comparable sampled titles. | ||
| 36 | + | ||
| 37 | +## Raw artifacts | ||
| 38 | + | ||
| 39 | +JSON aggregates (means, stdev, raw `values_ms`): same directory, `qwen3_vllm_{compact,standard}.json`, `qwen3_vllm_score_{compact,standard}.json`. | ||
| 40 | + | ||
| 41 | +## Results — mean latency (ms) | ||
| 42 | + | ||
| 43 | +| backend | instruction_format | n=100 | n=200 | n=400 | n=600 | n=800 | n=1000 | | ||
| 44 | +|---------|-------------------|------:|------:|------:|------:|------:|-------:| | ||
| 45 | +| `qwen3_vllm` | `compact` | 213.5 | 418.0 | 861.4 | 1263.4 | 1744.3 | 2162.2 | | ||
| 46 | +| `qwen3_vllm` | `standard` | 254.9 | 475.4 | 909.7 | 1353.2 | 1912.5 | 2406.7 | | ||
| 47 | +| `qwen3_vllm_score` | `compact` | 239.2 | 480.2 | 966.2 | 1433.5 | 1937.2 | 2428.4 | | ||
| 48 | +| `qwen3_vllm_score` | `standard` | 299.6 | 591.8 | 1178.9 | 1773.7 | 2341.6 | 2931.7 | | ||
| 49 | + | ||
| 50 | +## Short interpretation | ||
| 51 | + | ||
| 52 | +1. **`compact` vs `standard`:** For both backends, **`compact` is faster** on this setup (shorter / different chat template vs fixed yes/no system prompt + user block — see `reranker/backends/qwen3_vllm.py` / `qwen3_vllm_score.py`). | ||
| 53 | +2. **`qwen3_vllm` vs `qwen3_vllm_score`:** At **`n=1000`**, **`qwen3_vllm` + `compact`** is the fastest row (~2162 ms mean); **`qwen3_vllm_score` + `standard`** is the slowest (~2932 ms). Ordering can change on other GPUs / vLLM versions / batching. | ||
| 54 | +3. **Repo default** after tests: `services.rerank.backend: qwen3_vllm_score`, `instruction_format: compact` on **both** `qwen3_vllm` and `qwen3_vllm_score` blocks (patch script keeps them aligned). | ||
| 55 | + | ||
| 56 | +## Tooling added / changed | ||
| 57 | + | ||
| 58 | +- `reranker/server.py`: `/health` includes `instruction_format` when the active backend sets `_instruction_format`. | ||
| 59 | +- `scripts/benchmark_reranker_random_titles.py`: `--tag`, `--json-summary-out`, `--quiet-runs`. | ||
| 60 | +- `scripts/patch_rerank_vllm_benchmark_config.py`: surgical YAML patch (preserves newlines). | ||
| 61 | +- `scripts/run_reranker_vllm_instruction_benchmark.sh`: full matrix driver (continues if a benchmark exits non-zero; uses `--timeout 360`). |
reranker/server.py
| @@ -99,12 +99,17 @@ def health() -> Dict[str, Any]: | @@ -99,12 +99,17 @@ def health() -> Dict[str, Any]: | ||
| 99 | model_info = getattr(_reranker, "_model_name", None) or getattr( | 99 | model_info = getattr(_reranker, "_model_name", None) or getattr( |
| 100 | _reranker, "_config", {} | 100 | _reranker, "_config", {} |
| 101 | ).get("model_name", _backend_name) | 101 | ).get("model_name", _backend_name) |
| 102 | - return { | 102 | + payload: Dict[str, Any] = { |
| 103 | "status": "ok" if _reranker is not None else "unavailable", | 103 | "status": "ok" if _reranker is not None else "unavailable", |
| 104 | "model_loaded": _reranker is not None, | 104 | "model_loaded": _reranker is not None, |
| 105 | "model": model_info, | 105 | "model": model_info, |
| 106 | "backend": _backend_name, | 106 | "backend": _backend_name, |
| 107 | } | 107 | } |
| 108 | + if _reranker is not None: | ||
| 109 | + _fmt = getattr(_reranker, "_instruction_format", None) | ||
| 110 | + if _fmt is not None: | ||
| 111 | + payload["instruction_format"] = _fmt | ||
| 112 | + return payload | ||
| 108 | 113 | ||
| 109 | 114 | ||
| 110 | @app.post("/rerank", response_model=RerankResponse) | 115 | @app.post("/rerank", response_model=RerankResponse) |
scripts/benchmark_reranker_random_titles.py
| @@ -6,6 +6,7 @@ Randomly samples N titles from a text file (one title per line), POSTs to the | @@ -6,6 +6,7 @@ Randomly samples N titles from a text file (one title per line), POSTs to the | ||
| 6 | rerank HTTP API, prints wall-clock latency. | 6 | rerank HTTP API, prints wall-clock latency. |
| 7 | 7 | ||
| 8 | Supports multiple N values (comma-separated) and multiple repeats per N. | 8 | Supports multiple N values (comma-separated) and multiple repeats per N. |
| 9 | +Each invocation runs 3 warmup requests with n=400 first; those are not timed for summaries. | ||
| 9 | 10 | ||
| 10 | Example: | 11 | Example: |
| 11 | source activate.sh | 12 | source activate.sh |
| @@ -149,6 +150,23 @@ def main() -> int: | @@ -149,6 +150,23 @@ def main() -> int: | ||
| 149 | action="store_true", | 150 | action="store_true", |
| 150 | help="Print first ~500 chars of response body on success (last run only).", | 151 | help="Print first ~500 chars of response body on success (last run only).", |
| 151 | ) | 152 | ) |
| 153 | + parser.add_argument( | ||
| 154 | + "--tag", | ||
| 155 | + type=str, | ||
| 156 | + default=os.environ.get("BENCH_TAG", ""), | ||
| 157 | + help="Optional label stored in --json-summary-out (default: env BENCH_TAG or empty).", | ||
| 158 | + ) | ||
| 159 | + parser.add_argument( | ||
| 160 | + "--json-summary-out", | ||
| 161 | + type=Path, | ||
| 162 | + default=None, | ||
| 163 | + help="Write one JSON object with per-n latencies and aggregates for downstream tables.", | ||
| 164 | + ) | ||
| 165 | + parser.add_argument( | ||
| 166 | + "--quiet-runs", | ||
| 167 | + action="store_true", | ||
| 168 | + help="Suppress per-run lines; still prints warmup lines and text summaries.", | ||
| 169 | + ) | ||
| 152 | args = parser.parse_args() | 170 | args = parser.parse_args() |
| 153 | 171 | ||
| 154 | try: | 172 | try: |
| @@ -167,7 +185,9 @@ def main() -> int: | @@ -167,7 +185,9 @@ def main() -> int: | ||
| 167 | return 2 | 185 | return 2 |
| 168 | 186 | ||
| 169 | titles = _load_titles(args.titles_file) | 187 | titles = _load_titles(args.titles_file) |
| 170 | - max_n = max(doc_counts) | 188 | + warmup_n = 400 |
| 189 | + warmup_runs = 3 | ||
| 190 | + max_n = max(max(doc_counts), warmup_n) | ||
| 171 | if len(titles) < max_n: | 191 | if len(titles) < max_n: |
| 172 | print( | 192 | print( |
| 173 | f"error: file has only {len(titles)} non-empty lines, need at least {max_n}", | 193 | f"error: file has only {len(titles)} non-empty lines, need at least {max_n}", |
| @@ -181,6 +201,33 @@ def main() -> int: | @@ -181,6 +201,33 @@ def main() -> int: | ||
| 181 | summary: dict[int, List[float]] = {n: [] for n in doc_counts} | 201 | summary: dict[int, List[float]] = {n: [] for n in doc_counts} |
| 182 | 202 | ||
| 183 | with httpx.Client(timeout=args.timeout) as client: | 203 | with httpx.Client(timeout=args.timeout) as client: |
| 204 | + for w in range(warmup_runs): | ||
| 205 | + if args.seed is not None: | ||
| 206 | + random.seed(args.seed + 8_000_000 + w) | ||
| 207 | + docs_w = random.sample(titles, warmup_n) | ||
| 208 | + try: | ||
| 209 | + ok_w, status_w, _elapsed_w, scores_len_w, _text_w = _do_rerank( | ||
| 210 | + client, | ||
| 211 | + args.url, | ||
| 212 | + args.query, | ||
| 213 | + docs_w, | ||
| 214 | + top_n=top_n, | ||
| 215 | + normalize=normalize, | ||
| 216 | + ) | ||
| 217 | + except httpx.HTTPError as exc: | ||
| 218 | + print( | ||
| 219 | + f"warmup n={warmup_n} {w + 1}/{warmup_runs} error: request failed: {exc}", | ||
| 220 | + file=sys.stderr, | ||
| 221 | + ) | ||
| 222 | + any_fail = True | ||
| 223 | + continue | ||
| 224 | + if not ok_w: | ||
| 225 | + any_fail = True | ||
| 226 | + print( | ||
| 227 | + f"warmup n={warmup_n} {w + 1}/{warmup_runs} status={status_w} " | ||
| 228 | + f"scores={scores_len_w if scores_len_w is not None else 'n/a'} (not timed)" | ||
| 229 | + ) | ||
| 230 | + | ||
| 184 | for n in doc_counts: | 231 | for n in doc_counts: |
| 185 | for run_idx in range(repeat): | 232 | for run_idx in range(repeat): |
| 186 | if args.seed is not None: | 233 | if args.seed is not None: |
| @@ -208,10 +255,11 @@ def main() -> int: | @@ -208,10 +255,11 @@ def main() -> int: | ||
| 208 | else: | 255 | else: |
| 209 | any_fail = True | 256 | any_fail = True |
| 210 | 257 | ||
| 211 | - print( | ||
| 212 | - f"n={n} run={run_idx + 1}/{repeat} status={status} " | ||
| 213 | - f"latency_ms={elapsed_ms:.2f} scores={scores_len if scores_len is not None else 'n/a'}" | ||
| 214 | - ) | 258 | + if not args.quiet_runs: |
| 259 | + print( | ||
| 260 | + f"n={n} run={run_idx + 1}/{repeat} status={status} " | ||
| 261 | + f"latency_ms={elapsed_ms:.2f} scores={scores_len if scores_len is not None else 'n/a'}" | ||
| 262 | + ) | ||
| 215 | if args.print_body_preview and text and run_idx == repeat - 1 and n == doc_counts[-1]: | 263 | if args.print_body_preview and text and run_idx == repeat - 1 and n == doc_counts[-1]: |
| 216 | preview = text[:500] + ("…" if len(text) > 500 else "") | 264 | preview = text[:500] + ("…" if len(text) > 500 else "") |
| 217 | print(preview) | 265 | print(preview) |
| @@ -230,6 +278,33 @@ def main() -> int: | @@ -230,6 +278,33 @@ def main() -> int: | ||
| 230 | f"summary n={n} runs={len(lat)} min_ms={lo:.2f} max_ms={hi:.2f} avg_ms={avg:.2f}{extra}" | 278 | f"summary n={n} runs={len(lat)} min_ms={lo:.2f} max_ms={hi:.2f} avg_ms={avg:.2f}{extra}" |
| 231 | ) | 279 | ) |
| 232 | 280 | ||
| 281 | + if args.json_summary_out is not None: | ||
| 282 | + per_n: dict = {} | ||
| 283 | + for n in doc_counts: | ||
| 284 | + lat = summary[n] | ||
| 285 | + row: dict = {"values_ms": lat, "runs": len(lat)} | ||
| 286 | + if lat: | ||
| 287 | + row["mean_ms"] = statistics.mean(lat) | ||
| 288 | + row["min_ms"] = min(lat) | ||
| 289 | + row["max_ms"] = max(lat) | ||
| 290 | + if len(lat) >= 2: | ||
| 291 | + row["stdev_ms"] = statistics.stdev(lat) | ||
| 292 | + per_n[str(n)] = row | ||
| 293 | + out_obj = { | ||
| 294 | + "tag": args.tag or None, | ||
| 295 | + "doc_counts": doc_counts, | ||
| 296 | + "repeat": repeat, | ||
| 297 | + "url": args.url, | ||
| 298 | + "per_n": per_n, | ||
| 299 | + "failed": bool(any_fail), | ||
| 300 | + } | ||
| 301 | + args.json_summary_out.parent.mkdir(parents=True, exist_ok=True) | ||
| 302 | + args.json_summary_out.write_text( | ||
| 303 | + json.dumps(out_obj, ensure_ascii=False, indent=2) + "\n", | ||
| 304 | + encoding="utf-8", | ||
| 305 | + ) | ||
| 306 | + print(f"wrote json summary -> {args.json_summary_out}") | ||
| 307 | + | ||
| 233 | return 1 if any_fail else 0 | 308 | return 1 if any_fail else 0 |
| 234 | 309 | ||
| 235 | 310 |
| @@ -0,0 +1,100 @@ | @@ -0,0 +1,100 @@ | ||
| 1 | +#!/usr/bin/env python3 | ||
| 2 | +""" | ||
| 3 | +Surgically patch config/config.yaml: | ||
| 4 | + services.rerank.backend | ||
| 5 | + services.rerank.backends.qwen3_vllm.instruction_format | ||
| 6 | + services.rerank.backends.qwen3_vllm_score.instruction_format | ||
| 7 | + | ||
| 8 | +Preserves comments and unrelated lines. Used for benchmark matrix runs. | ||
| 9 | +""" | ||
| 10 | + | ||
| 11 | +from __future__ import annotations | ||
| 12 | + | ||
| 13 | +import argparse | ||
| 14 | +import re | ||
| 15 | +import sys | ||
| 16 | +from pathlib import Path | ||
| 17 | + | ||
| 18 | + | ||
| 19 | +def _with_stripped_body(line: str) -> tuple[str, str]: | ||
| 20 | + """Return (body without end newline, newline suffix including '' if none).""" | ||
| 21 | + if line.endswith("\r\n"): | ||
| 22 | + return line[:-2], "\r\n" | ||
| 23 | + if line.endswith("\n"): | ||
| 24 | + return line[:-1], "\n" | ||
| 25 | + return line, "" | ||
| 26 | + | ||
| 27 | + | ||
| 28 | +def _patch_backend_in_rerank_block(lines: list[str], backend: str) -> None: | ||
| 29 | + in_rerank = False | ||
| 30 | + for i, line in enumerate(lines): | ||
| 31 | + if line.startswith(" rerank:"): | ||
| 32 | + in_rerank = True | ||
| 33 | + continue | ||
| 34 | + if in_rerank: | ||
| 35 | + if line.startswith(" ") and not line.startswith(" ") and line.strip(): | ||
| 36 | + in_rerank = False | ||
| 37 | + continue | ||
| 38 | + body, nl = _with_stripped_body(line) | ||
| 39 | + m = re.match(r'^(\s*backend:\s*")[^"]+(".*)$', body) | ||
| 40 | + if m: | ||
| 41 | + lines[i] = f'{m.group(1)}{backend}{m.group(2)}{nl}' | ||
| 42 | + return | ||
| 43 | + raise RuntimeError("services.rerank.backend line not found") | ||
| 44 | + | ||
| 45 | + | ||
| 46 | +def _patch_instruction_format_under_backend( | ||
| 47 | + lines: list[str], section: str, fmt: str | ||
| 48 | +) -> None: | ||
| 49 | + """section is 'qwen3_vllm' or 'qwen3_vllm_score' (first line is ' qwen3_vllm:').""" | ||
| 50 | + header = f" {section}:" | ||
| 51 | + start = None | ||
| 52 | + for i, line in enumerate(lines): | ||
| 53 | + if line.rstrip() == header: | ||
| 54 | + start = i | ||
| 55 | + break | ||
| 56 | + if start is None: | ||
| 57 | + raise RuntimeError(f"section {section!r} not found") | ||
| 58 | + | ||
| 59 | + for j in range(start + 1, len(lines)): | ||
| 60 | + line = lines[j] | ||
| 61 | + body, nl = _with_stripped_body(line) | ||
| 62 | + if re.match(r"^ [a-zA-Z0-9_]+:\s*$", body): | ||
| 63 | + break | ||
| 64 | + m = re.match(r"^(\s*instruction_format:\s*)\S+", body) | ||
| 65 | + if m: | ||
| 66 | + lines[j] = f"{m.group(1)}{fmt}{nl}" | ||
| 67 | + return | ||
| 68 | + raise RuntimeError(f"instruction_format not found under {section!r}") | ||
| 69 | + | ||
| 70 | + | ||
| 71 | +def main() -> int: | ||
| 72 | + p = argparse.ArgumentParser() | ||
| 73 | + p.add_argument( | ||
| 74 | + "--config", | ||
| 75 | + type=Path, | ||
| 76 | + default=Path(__file__).resolve().parent.parent / "config" / "config.yaml", | ||
| 77 | + ) | ||
| 78 | + p.add_argument("--backend", choices=("qwen3_vllm", "qwen3_vllm_score"), required=True) | ||
| 79 | + p.add_argument( | ||
| 80 | + "--instruction-format", | ||
| 81 | + dest="instruction_format", | ||
| 82 | + choices=("compact", "standard"), | ||
| 83 | + required=True, | ||
| 84 | + ) | ||
| 85 | + args = p.parse_args() | ||
| 86 | + text = args.config.read_text(encoding="utf-8") | ||
| 87 | + lines = text.splitlines(keepends=True) | ||
| 88 | + if not lines: | ||
| 89 | + print("empty config", file=sys.stderr) | ||
| 90 | + return 2 | ||
| 91 | + _patch_backend_in_rerank_block(lines, args.backend) | ||
| 92 | + _patch_instruction_format_under_backend(lines, "qwen3_vllm", args.instruction_format) | ||
| 93 | + _patch_instruction_format_under_backend(lines, "qwen3_vllm_score", args.instruction_format) | ||
| 94 | + args.config.write_text("".join(lines), encoding="utf-8") | ||
| 95 | + print(f"patched {args.config}: backend={args.backend} instruction_format={args.instruction_format} (both vLLM blocks)") | ||
| 96 | + return 0 | ||
| 97 | + | ||
| 98 | + | ||
| 99 | +if __name__ == "__main__": | ||
| 100 | + raise SystemExit(main()) |
| @@ -0,0 +1,89 @@ | @@ -0,0 +1,89 @@ | ||
| 1 | +#!/usr/bin/env bash | ||
| 2 | +# Patch config, restart reranker, wait for /health, run benchmark_reranker_random_titles.py. | ||
| 3 | +# Requires: curl, .venv with PyYAML not needed (patch is standalone Python). | ||
| 4 | + | ||
| 5 | +set -euo pipefail | ||
| 6 | +ROOT="$(cd "$(dirname "$0")/.." && pwd)" | ||
| 7 | +cd "$ROOT" | ||
| 8 | + | ||
| 9 | +PYTHON="${ROOT}/.venv/bin/python" | ||
| 10 | +DAY="$(date +%F)" | ||
| 11 | +OUT_DIR="${ROOT}/perf_reports/reranker_vllm_instruction/${DAY}" | ||
| 12 | +mkdir -p "$OUT_DIR" | ||
| 13 | + | ||
| 14 | +health_ok() { | ||
| 15 | + local want_backend="$1" | ||
| 16 | + local want_fmt="$2" | ||
| 17 | + local body | ||
| 18 | + if ! body="$(curl -sS --connect-timeout 2 --max-time 5 "http://127.0.0.1:6007/health" 2>/dev/null)"; then | ||
| 19 | + return 1 | ||
| 20 | + fi | ||
| 21 | + echo "$body" | "$PYTHON" -c " | ||
| 22 | +import json, sys | ||
| 23 | +want_b, want_f = sys.argv[1], sys.argv[2] | ||
| 24 | +d = json.load(sys.stdin) | ||
| 25 | +if d.get('status') != 'ok' or not d.get('model_loaded'): | ||
| 26 | + sys.exit(1) | ||
| 27 | +if d.get('backend') != want_b: | ||
| 28 | + sys.exit(1) | ||
| 29 | +if d.get('instruction_format') != want_f: | ||
| 30 | + sys.exit(1) | ||
| 31 | +sys.exit(0) | ||
| 32 | +" "$want_backend" "$want_fmt" | ||
| 33 | +} | ||
| 34 | + | ||
| 35 | +wait_health() { | ||
| 36 | + local want_backend="$1" | ||
| 37 | + local want_fmt="$2" | ||
| 38 | + local i | ||
| 39 | + for i in $(seq 1 180); do | ||
| 40 | + if health_ok "$want_backend" "$want_fmt"; then | ||
| 41 | + curl -sS "http://127.0.0.1:6007/health" | "$PYTHON" -m json.tool | ||
| 42 | + return 0 | ||
| 43 | + fi | ||
| 44 | + echo "[wait] ${i}/180 backend=${want_backend} instruction_format=${want_fmt} ..." | ||
| 45 | + sleep 3 | ||
| 46 | + done | ||
| 47 | + echo "[error] health did not match in time" >&2 | ||
| 48 | + return 1 | ||
| 49 | +} | ||
| 50 | + | ||
| 51 | +run_one() { | ||
| 52 | + local backend="$1" | ||
| 53 | + local fmt="$2" | ||
| 54 | + local tag="${backend}|${fmt}" | ||
| 55 | + local jf="${OUT_DIR}/${backend}_${fmt}.json" | ||
| 56 | + | ||
| 57 | + echo "========== ${tag} ==========" | ||
| 58 | + "$PYTHON" "${ROOT}/scripts/patch_rerank_vllm_benchmark_config.py" \ | ||
| 59 | + --backend "$backend" --instruction-format "$fmt" | ||
| 60 | + | ||
| 61 | + "${ROOT}/restart.sh" reranker | ||
| 62 | + wait_health "$backend" "$fmt" | ||
| 63 | + | ||
| 64 | + if ! "$PYTHON" "${ROOT}/scripts/benchmark_reranker_random_titles.py" \ | ||
| 65 | + 100,200,400,600,800,1000 \ | ||
| 66 | + --repeat 5 \ | ||
| 67 | + --seed 42 \ | ||
| 68 | + --quiet-runs \ | ||
| 69 | + --timeout 360 \ | ||
| 70 | + --tag "$tag" \ | ||
| 71 | + --json-summary-out "$jf" | ||
| 72 | + then | ||
| 73 | + echo "[warn] benchmark exited non-zero for ${tag} (see ${jf} failed flag / partial runs)" >&2 | ||
| 74 | + fi | ||
| 75 | + | ||
| 76 | + echo "artifact: $jf" | ||
| 77 | +} | ||
| 78 | + | ||
| 79 | +run_one qwen3_vllm compact | ||
| 80 | +run_one qwen3_vllm standard | ||
| 81 | +run_one qwen3_vllm_score compact | ||
| 82 | +run_one qwen3_vllm_score standard | ||
| 83 | + | ||
| 84 | +# Restore repo-default-style rerank settings (score + compact). | ||
| 85 | +"$PYTHON" "${ROOT}/scripts/patch_rerank_vllm_benchmark_config.py" \ | ||
| 86 | + --backend qwen3_vllm_score --instruction-format compact | ||
| 87 | +"${ROOT}/restart.sh" reranker | ||
| 88 | +wait_health qwen3_vllm_score compact | ||
| 89 | +echo "Restored config: qwen3_vllm_score + compact. Done. Artifacts under ${OUT_DIR}" |