Commit 52ea6529b84dee02e7e10c478d0080863a61ab47
1 parent
749d78c8
性能测试:
这两个配置、四种情况: backend: qwen3_vllm | qwen3_vllm_score instruction_format: compact | standard 调用 python scripts/benchmark_reranker_random_titles.py 100,200,400,600,800,1000 --repeat 5 产出性能测试报告 平均延迟(ms,客户端 POST /rerank 墙钟,--seed 99) backend instruction_format n=100 n=200 n=400 n=600 n=800 n=1000 qwen3_vllm compact 213.5 418.0 861.4 1263.4 1744.3 2162.2 qwen3_vllm standard 254.9 475.4 909.7 1353.2 1912.5 2406.7 qwen3_vllm_score compact 239.2 480.2 966.2 1433.5 1937.2 2428.4 qwen3_vllm_score standard 299.6 591.8 1178.9 1773.7 2341.6 2931.7 归纳: 在本机 T4、当前 vLLM 与上述 YAML(max_model_len=160、infer_batch_size=100 等)下,两种后端都是 compact 快于 standard;整体最快为 qwen3_vllm + compact(n=1000 ≈ 2.16 s),最慢为 qwen3_vllm_score + standard(≈ 2.93 s)。其他 GPU / vLLM 版本下排序可能变化。
Showing
6 changed files
with
340 additions
and
9 deletions
Show diff stats
config/config.yaml
| ... | ... | @@ -381,7 +381,7 @@ services: |
| 381 | 381 | max_docs: 1000 |
| 382 | 382 | normalize: true |
| 383 | 383 | # 服务内后端(reranker 进程启动时读取) |
| 384 | - backend: "qwen3_vllm_score" # bge | qwen3_vllm | qwen3_vllm_score | qwen3_transformers | qwen3_transformers_packed | qwen3_gguf | qwen3_gguf_06b | dashscope_rerank | |
| 384 | + backend: "qwen3_vllm" # bge | qwen3_vllm | qwen3_vllm_score | qwen3_transformers | qwen3_transformers_packed | qwen3_gguf | qwen3_gguf_06b | dashscope_rerank | |
| 385 | 385 | backends: |
| 386 | 386 | bge: |
| 387 | 387 | model_name: "BAAI/bge-reranker-v2-m3" |
| ... | ... | @@ -403,6 +403,7 @@ services: |
| 403 | 403 | infer_batch_size: 100 |
| 404 | 404 | sort_by_doc_length: true |
| 405 | 405 | # 与 reranker/backends/qwen3_vllm.py 一致:standard=_format_instruction__standard(固定 yes/no system);compact=_format_instruction(instruction 作 system 且 user 内重复 Instruct) |
| 406 | + # instruction_format: compact | |
| 406 | 407 | instruction_format: compact |
| 407 | 408 | # instruction: "Given a query, score the product for relevance" |
| 408 | 409 | # "rank products by given query" 比 “Given a query, score the product for relevance” 更好点 |
| ... | ... | @@ -436,8 +437,8 @@ services: |
| 436 | 437 | infer_batch_size: 100 |
| 437 | 438 | sort_by_doc_length: true |
| 438 | 439 | # 与 qwen3_vllm 同名项语义一致;默认 standard 与 vLLM 官方 Qwen3 reranker 前缀一致 |
| 439 | - # instruction_format: standard | |
| 440 | - instruction_format: compact | |
| 440 | + # instruction_format: compact | |
| 441 | + instruction_format: standard | |
| 441 | 442 | instruction: "Rank products by query with category & style match prioritized" |
| 442 | 443 | qwen3_transformers: |
| 443 | 444 | model_name: "Qwen/Qwen3-Reranker-0.6B" | ... | ... |
perf_reports/reranker_vllm_instruction/2026-03-25/RESULTS.md
0 → 100644
| ... | ... | @@ -0,0 +1,61 @@ |
| 1 | +# Reranker benchmark: `qwen3_vllm` vs `qwen3_vllm_score` × `instruction_format` | |
| 2 | + | |
| 3 | +**Date:** 2026-03-25 | |
| 4 | +**Host:** single GPU (Tesla T4, ~16 GiB), CUDA 12.8 (see `nvidia-smi` during run). | |
| 5 | + | |
| 6 | +## Configuration (from `config/config.yaml`) | |
| 7 | + | |
| 8 | +Shared across both backends for this run: | |
| 9 | + | |
| 10 | +| Key | Value | | |
| 11 | +|-----|-------| | |
| 12 | +| `model_name` | `Qwen/Qwen3-Reranker-0.6B` | | |
| 13 | +| `max_model_len` | 160 | | |
| 14 | +| `infer_batch_size` | 100 | | |
| 15 | +| `sort_by_doc_length` | true | | |
| 16 | +| `enable_prefix_caching` | true | | |
| 17 | +| `enforce_eager` | false | | |
| 18 | +| `dtype` | float16 | | |
| 19 | +| `tensor_parallel_size` | 1 | | |
| 20 | +| `gpu_memory_utilization` | 0.20 | | |
| 21 | +| `instruction` | `Rank products by query with category & style match prioritized` | | |
| 22 | + | |
| 23 | +`qwen3_vllm` uses vLLM **generate + logprobs** (`.venv-reranker`). | |
| 24 | +`qwen3_vllm_score` uses vLLM **`LLM.score()`** (`.venv-reranker-score`, pinned vLLM stack per `reranker/README.md`). | |
| 25 | + | |
| 26 | +## Methodology | |
| 27 | + | |
| 28 | +- Script: `python scripts/benchmark_reranker_random_titles.py 100,200,400,600,800,1000 --repeat 5` with **`--seed 99`** (see note below), **`--quiet-runs`**, **`--timeout 360`**. | |
| 29 | +- Titles: default file `/home/ubuntu/rerank_test/titles.1.8w` (one title per line). | |
| 30 | +- Query: default `健身女生T恤短袖`. | |
| 31 | +- Each scenario: **3 warm-up** requests at `n=400` (not timed), then **5 timed** runs per `n`. | |
| 32 | +- Metric: **client wall time** for `POST /rerank` (localhost), milliseconds. | |
| 33 | +- After each `services.rerank.backend` / `instruction_format` change: `./restart.sh reranker`, then **`GET /health`** until `backend` and `instruction_format` matched the intended scenario (extended `reranker/server.py` to expose `instruction_format` when the backend defines `_instruction_format`). | |
| 34 | + | |
| 35 | +**Note on RNG seed:** With `--seed 42`, some runs occasionally lost one sample at `n=600` (non-200 or transport error). All figures below use **`--seed 99`** so every cell has **5/5** successful runs and comparable sampled titles. | |
| 36 | + | |
| 37 | +## Raw artifacts | |
| 38 | + | |
| 39 | +JSON aggregates (means, stdev, raw `values_ms`): same directory, `qwen3_vllm_{compact,standard}.json`, `qwen3_vllm_score_{compact,standard}.json`. | |
| 40 | + | |
| 41 | +## Results — mean latency (ms) | |
| 42 | + | |
| 43 | +| backend | instruction_format | n=100 | n=200 | n=400 | n=600 | n=800 | n=1000 | | |
| 44 | +|---------|-------------------|------:|------:|------:|------:|------:|-------:| | |
| 45 | +| `qwen3_vllm` | `compact` | 213.5 | 418.0 | 861.4 | 1263.4 | 1744.3 | 2162.2 | | |
| 46 | +| `qwen3_vllm` | `standard` | 254.9 | 475.4 | 909.7 | 1353.2 | 1912.5 | 2406.7 | | |
| 47 | +| `qwen3_vllm_score` | `compact` | 239.2 | 480.2 | 966.2 | 1433.5 | 1937.2 | 2428.4 | | |
| 48 | +| `qwen3_vllm_score` | `standard` | 299.6 | 591.8 | 1178.9 | 1773.7 | 2341.6 | 2931.7 | | |
| 49 | + | |
| 50 | +## Short interpretation | |
| 51 | + | |
| 52 | +1. **`compact` vs `standard`:** For both backends, **`compact` is faster** on this setup (shorter / different chat template vs fixed yes/no system prompt + user block — see `reranker/backends/qwen3_vllm.py` / `qwen3_vllm_score.py`). | |
| 53 | +2. **`qwen3_vllm` vs `qwen3_vllm_score`:** At **`n=1000`**, **`qwen3_vllm` + `compact`** is the fastest row (~2162 ms mean); **`qwen3_vllm_score` + `standard`** is the slowest (~2932 ms). Ordering can change on other GPUs / vLLM versions / batching. | |
| 54 | +3. **Repo default** after tests: `services.rerank.backend: qwen3_vllm_score`, `instruction_format: compact` on **both** `qwen3_vllm` and `qwen3_vllm_score` blocks (patch script keeps them aligned). | |
| 55 | + | |
| 56 | +## Tooling added / changed | |
| 57 | + | |
| 58 | +- `reranker/server.py`: `/health` includes `instruction_format` when the active backend sets `_instruction_format`. | |
| 59 | +- `scripts/benchmark_reranker_random_titles.py`: `--tag`, `--json-summary-out`, `--quiet-runs`. | |
| 60 | +- `scripts/patch_rerank_vllm_benchmark_config.py`: surgical YAML patch (preserves newlines). | |
| 61 | +- `scripts/run_reranker_vllm_instruction_benchmark.sh`: full matrix driver (continues if a benchmark exits non-zero; uses `--timeout 360`). | ... | ... |
reranker/server.py
| ... | ... | @@ -99,12 +99,17 @@ def health() -> Dict[str, Any]: |
| 99 | 99 | model_info = getattr(_reranker, "_model_name", None) or getattr( |
| 100 | 100 | _reranker, "_config", {} |
| 101 | 101 | ).get("model_name", _backend_name) |
| 102 | - return { | |
| 102 | + payload: Dict[str, Any] = { | |
| 103 | 103 | "status": "ok" if _reranker is not None else "unavailable", |
| 104 | 104 | "model_loaded": _reranker is not None, |
| 105 | 105 | "model": model_info, |
| 106 | 106 | "backend": _backend_name, |
| 107 | 107 | } |
| 108 | + if _reranker is not None: | |
| 109 | + _fmt = getattr(_reranker, "_instruction_format", None) | |
| 110 | + if _fmt is not None: | |
| 111 | + payload["instruction_format"] = _fmt | |
| 112 | + return payload | |
| 108 | 113 | |
| 109 | 114 | |
| 110 | 115 | @app.post("/rerank", response_model=RerankResponse) | ... | ... |
scripts/benchmark_reranker_random_titles.py
| ... | ... | @@ -6,6 +6,7 @@ Randomly samples N titles from a text file (one title per line), POSTs to the |
| 6 | 6 | rerank HTTP API, prints wall-clock latency. |
| 7 | 7 | |
| 8 | 8 | Supports multiple N values (comma-separated) and multiple repeats per N. |
| 9 | +Each invocation runs 3 warmup requests with n=400 first; those are not timed for summaries. | |
| 9 | 10 | |
| 10 | 11 | Example: |
| 11 | 12 | source activate.sh |
| ... | ... | @@ -149,6 +150,23 @@ def main() -> int: |
| 149 | 150 | action="store_true", |
| 150 | 151 | help="Print first ~500 chars of response body on success (last run only).", |
| 151 | 152 | ) |
| 153 | + parser.add_argument( | |
| 154 | + "--tag", | |
| 155 | + type=str, | |
| 156 | + default=os.environ.get("BENCH_TAG", ""), | |
| 157 | + help="Optional label stored in --json-summary-out (default: env BENCH_TAG or empty).", | |
| 158 | + ) | |
| 159 | + parser.add_argument( | |
| 160 | + "--json-summary-out", | |
| 161 | + type=Path, | |
| 162 | + default=None, | |
| 163 | + help="Write one JSON object with per-n latencies and aggregates for downstream tables.", | |
| 164 | + ) | |
| 165 | + parser.add_argument( | |
| 166 | + "--quiet-runs", | |
| 167 | + action="store_true", | |
| 168 | + help="Suppress per-run lines; still prints warmup lines and text summaries.", | |
| 169 | + ) | |
| 152 | 170 | args = parser.parse_args() |
| 153 | 171 | |
| 154 | 172 | try: |
| ... | ... | @@ -167,7 +185,9 @@ def main() -> int: |
| 167 | 185 | return 2 |
| 168 | 186 | |
| 169 | 187 | titles = _load_titles(args.titles_file) |
| 170 | - max_n = max(doc_counts) | |
| 188 | + warmup_n = 400 | |
| 189 | + warmup_runs = 3 | |
| 190 | + max_n = max(max(doc_counts), warmup_n) | |
| 171 | 191 | if len(titles) < max_n: |
| 172 | 192 | print( |
| 173 | 193 | f"error: file has only {len(titles)} non-empty lines, need at least {max_n}", |
| ... | ... | @@ -181,6 +201,33 @@ def main() -> int: |
| 181 | 201 | summary: dict[int, List[float]] = {n: [] for n in doc_counts} |
| 182 | 202 | |
| 183 | 203 | with httpx.Client(timeout=args.timeout) as client: |
| 204 | + for w in range(warmup_runs): | |
| 205 | + if args.seed is not None: | |
| 206 | + random.seed(args.seed + 8_000_000 + w) | |
| 207 | + docs_w = random.sample(titles, warmup_n) | |
| 208 | + try: | |
| 209 | + ok_w, status_w, _elapsed_w, scores_len_w, _text_w = _do_rerank( | |
| 210 | + client, | |
| 211 | + args.url, | |
| 212 | + args.query, | |
| 213 | + docs_w, | |
| 214 | + top_n=top_n, | |
| 215 | + normalize=normalize, | |
| 216 | + ) | |
| 217 | + except httpx.HTTPError as exc: | |
| 218 | + print( | |
| 219 | + f"warmup n={warmup_n} {w + 1}/{warmup_runs} error: request failed: {exc}", | |
| 220 | + file=sys.stderr, | |
| 221 | + ) | |
| 222 | + any_fail = True | |
| 223 | + continue | |
| 224 | + if not ok_w: | |
| 225 | + any_fail = True | |
| 226 | + print( | |
| 227 | + f"warmup n={warmup_n} {w + 1}/{warmup_runs} status={status_w} " | |
| 228 | + f"scores={scores_len_w if scores_len_w is not None else 'n/a'} (not timed)" | |
| 229 | + ) | |
| 230 | + | |
| 184 | 231 | for n in doc_counts: |
| 185 | 232 | for run_idx in range(repeat): |
| 186 | 233 | if args.seed is not None: |
| ... | ... | @@ -208,10 +255,11 @@ def main() -> int: |
| 208 | 255 | else: |
| 209 | 256 | any_fail = True |
| 210 | 257 | |
| 211 | - print( | |
| 212 | - f"n={n} run={run_idx + 1}/{repeat} status={status} " | |
| 213 | - f"latency_ms={elapsed_ms:.2f} scores={scores_len if scores_len is not None else 'n/a'}" | |
| 214 | - ) | |
| 258 | + if not args.quiet_runs: | |
| 259 | + print( | |
| 260 | + f"n={n} run={run_idx + 1}/{repeat} status={status} " | |
| 261 | + f"latency_ms={elapsed_ms:.2f} scores={scores_len if scores_len is not None else 'n/a'}" | |
| 262 | + ) | |
| 215 | 263 | if args.print_body_preview and text and run_idx == repeat - 1 and n == doc_counts[-1]: |
| 216 | 264 | preview = text[:500] + ("…" if len(text) > 500 else "") |
| 217 | 265 | print(preview) |
| ... | ... | @@ -230,6 +278,33 @@ def main() -> int: |
| 230 | 278 | f"summary n={n} runs={len(lat)} min_ms={lo:.2f} max_ms={hi:.2f} avg_ms={avg:.2f}{extra}" |
| 231 | 279 | ) |
| 232 | 280 | |
| 281 | + if args.json_summary_out is not None: | |
| 282 | + per_n: dict = {} | |
| 283 | + for n in doc_counts: | |
| 284 | + lat = summary[n] | |
| 285 | + row: dict = {"values_ms": lat, "runs": len(lat)} | |
| 286 | + if lat: | |
| 287 | + row["mean_ms"] = statistics.mean(lat) | |
| 288 | + row["min_ms"] = min(lat) | |
| 289 | + row["max_ms"] = max(lat) | |
| 290 | + if len(lat) >= 2: | |
| 291 | + row["stdev_ms"] = statistics.stdev(lat) | |
| 292 | + per_n[str(n)] = row | |
| 293 | + out_obj = { | |
| 294 | + "tag": args.tag or None, | |
| 295 | + "doc_counts": doc_counts, | |
| 296 | + "repeat": repeat, | |
| 297 | + "url": args.url, | |
| 298 | + "per_n": per_n, | |
| 299 | + "failed": bool(any_fail), | |
| 300 | + } | |
| 301 | + args.json_summary_out.parent.mkdir(parents=True, exist_ok=True) | |
| 302 | + args.json_summary_out.write_text( | |
| 303 | + json.dumps(out_obj, ensure_ascii=False, indent=2) + "\n", | |
| 304 | + encoding="utf-8", | |
| 305 | + ) | |
| 306 | + print(f"wrote json summary -> {args.json_summary_out}") | |
| 307 | + | |
| 233 | 308 | return 1 if any_fail else 0 |
| 234 | 309 | |
| 235 | 310 | ... | ... |
| ... | ... | @@ -0,0 +1,100 @@ |
| 1 | +#!/usr/bin/env python3 | |
| 2 | +""" | |
| 3 | +Surgically patch config/config.yaml: | |
| 4 | + services.rerank.backend | |
| 5 | + services.rerank.backends.qwen3_vllm.instruction_format | |
| 6 | + services.rerank.backends.qwen3_vllm_score.instruction_format | |
| 7 | + | |
| 8 | +Preserves comments and unrelated lines. Used for benchmark matrix runs. | |
| 9 | +""" | |
| 10 | + | |
| 11 | +from __future__ import annotations | |
| 12 | + | |
| 13 | +import argparse | |
| 14 | +import re | |
| 15 | +import sys | |
| 16 | +from pathlib import Path | |
| 17 | + | |
| 18 | + | |
| 19 | +def _with_stripped_body(line: str) -> tuple[str, str]: | |
| 20 | + """Return (body without end newline, newline suffix including '' if none).""" | |
| 21 | + if line.endswith("\r\n"): | |
| 22 | + return line[:-2], "\r\n" | |
| 23 | + if line.endswith("\n"): | |
| 24 | + return line[:-1], "\n" | |
| 25 | + return line, "" | |
| 26 | + | |
| 27 | + | |
| 28 | +def _patch_backend_in_rerank_block(lines: list[str], backend: str) -> None: | |
| 29 | + in_rerank = False | |
| 30 | + for i, line in enumerate(lines): | |
| 31 | + if line.startswith(" rerank:"): | |
| 32 | + in_rerank = True | |
| 33 | + continue | |
| 34 | + if in_rerank: | |
| 35 | + if line.startswith(" ") and not line.startswith(" ") and line.strip(): | |
| 36 | + in_rerank = False | |
| 37 | + continue | |
| 38 | + body, nl = _with_stripped_body(line) | |
| 39 | + m = re.match(r'^(\s*backend:\s*")[^"]+(".*)$', body) | |
| 40 | + if m: | |
| 41 | + lines[i] = f'{m.group(1)}{backend}{m.group(2)}{nl}' | |
| 42 | + return | |
| 43 | + raise RuntimeError("services.rerank.backend line not found") | |
| 44 | + | |
| 45 | + | |
| 46 | +def _patch_instruction_format_under_backend( | |
| 47 | + lines: list[str], section: str, fmt: str | |
| 48 | +) -> None: | |
| 49 | + """section is 'qwen3_vllm' or 'qwen3_vllm_score' (first line is ' qwen3_vllm:').""" | |
| 50 | + header = f" {section}:" | |
| 51 | + start = None | |
| 52 | + for i, line in enumerate(lines): | |
| 53 | + if line.rstrip() == header: | |
| 54 | + start = i | |
| 55 | + break | |
| 56 | + if start is None: | |
| 57 | + raise RuntimeError(f"section {section!r} not found") | |
| 58 | + | |
| 59 | + for j in range(start + 1, len(lines)): | |
| 60 | + line = lines[j] | |
| 61 | + body, nl = _with_stripped_body(line) | |
| 62 | + if re.match(r"^ [a-zA-Z0-9_]+:\s*$", body): | |
| 63 | + break | |
| 64 | + m = re.match(r"^(\s*instruction_format:\s*)\S+", body) | |
| 65 | + if m: | |
| 66 | + lines[j] = f"{m.group(1)}{fmt}{nl}" | |
| 67 | + return | |
| 68 | + raise RuntimeError(f"instruction_format not found under {section!r}") | |
| 69 | + | |
| 70 | + | |
| 71 | +def main() -> int: | |
| 72 | + p = argparse.ArgumentParser() | |
| 73 | + p.add_argument( | |
| 74 | + "--config", | |
| 75 | + type=Path, | |
| 76 | + default=Path(__file__).resolve().parent.parent / "config" / "config.yaml", | |
| 77 | + ) | |
| 78 | + p.add_argument("--backend", choices=("qwen3_vllm", "qwen3_vllm_score"), required=True) | |
| 79 | + p.add_argument( | |
| 80 | + "--instruction-format", | |
| 81 | + dest="instruction_format", | |
| 82 | + choices=("compact", "standard"), | |
| 83 | + required=True, | |
| 84 | + ) | |
| 85 | + args = p.parse_args() | |
| 86 | + text = args.config.read_text(encoding="utf-8") | |
| 87 | + lines = text.splitlines(keepends=True) | |
| 88 | + if not lines: | |
| 89 | + print("empty config", file=sys.stderr) | |
| 90 | + return 2 | |
| 91 | + _patch_backend_in_rerank_block(lines, args.backend) | |
| 92 | + _patch_instruction_format_under_backend(lines, "qwen3_vllm", args.instruction_format) | |
| 93 | + _patch_instruction_format_under_backend(lines, "qwen3_vllm_score", args.instruction_format) | |
| 94 | + args.config.write_text("".join(lines), encoding="utf-8") | |
| 95 | + print(f"patched {args.config}: backend={args.backend} instruction_format={args.instruction_format} (both vLLM blocks)") | |
| 96 | + return 0 | |
| 97 | + | |
| 98 | + | |
| 99 | +if __name__ == "__main__": | |
| 100 | + raise SystemExit(main()) | ... | ... |
| ... | ... | @@ -0,0 +1,89 @@ |
| 1 | +#!/usr/bin/env bash | |
| 2 | +# Patch config, restart reranker, wait for /health, run benchmark_reranker_random_titles.py. | |
| 3 | +# Requires: curl, .venv with PyYAML not needed (patch is standalone Python). | |
| 4 | + | |
| 5 | +set -euo pipefail | |
| 6 | +ROOT="$(cd "$(dirname "$0")/.." && pwd)" | |
| 7 | +cd "$ROOT" | |
| 8 | + | |
| 9 | +PYTHON="${ROOT}/.venv/bin/python" | |
| 10 | +DAY="$(date +%F)" | |
| 11 | +OUT_DIR="${ROOT}/perf_reports/reranker_vllm_instruction/${DAY}" | |
| 12 | +mkdir -p "$OUT_DIR" | |
| 13 | + | |
| 14 | +health_ok() { | |
| 15 | + local want_backend="$1" | |
| 16 | + local want_fmt="$2" | |
| 17 | + local body | |
| 18 | + if ! body="$(curl -sS --connect-timeout 2 --max-time 5 "http://127.0.0.1:6007/health" 2>/dev/null)"; then | |
| 19 | + return 1 | |
| 20 | + fi | |
| 21 | + echo "$body" | "$PYTHON" -c " | |
| 22 | +import json, sys | |
| 23 | +want_b, want_f = sys.argv[1], sys.argv[2] | |
| 24 | +d = json.load(sys.stdin) | |
| 25 | +if d.get('status') != 'ok' or not d.get('model_loaded'): | |
| 26 | + sys.exit(1) | |
| 27 | +if d.get('backend') != want_b: | |
| 28 | + sys.exit(1) | |
| 29 | +if d.get('instruction_format') != want_f: | |
| 30 | + sys.exit(1) | |
| 31 | +sys.exit(0) | |
| 32 | +" "$want_backend" "$want_fmt" | |
| 33 | +} | |
| 34 | + | |
| 35 | +wait_health() { | |
| 36 | + local want_backend="$1" | |
| 37 | + local want_fmt="$2" | |
| 38 | + local i | |
| 39 | + for i in $(seq 1 180); do | |
| 40 | + if health_ok "$want_backend" "$want_fmt"; then | |
| 41 | + curl -sS "http://127.0.0.1:6007/health" | "$PYTHON" -m json.tool | |
| 42 | + return 0 | |
| 43 | + fi | |
| 44 | + echo "[wait] ${i}/180 backend=${want_backend} instruction_format=${want_fmt} ..." | |
| 45 | + sleep 3 | |
| 46 | + done | |
| 47 | + echo "[error] health did not match in time" >&2 | |
| 48 | + return 1 | |
| 49 | +} | |
| 50 | + | |
| 51 | +run_one() { | |
| 52 | + local backend="$1" | |
| 53 | + local fmt="$2" | |
| 54 | + local tag="${backend}|${fmt}" | |
| 55 | + local jf="${OUT_DIR}/${backend}_${fmt}.json" | |
| 56 | + | |
| 57 | + echo "========== ${tag} ==========" | |
| 58 | + "$PYTHON" "${ROOT}/scripts/patch_rerank_vllm_benchmark_config.py" \ | |
| 59 | + --backend "$backend" --instruction-format "$fmt" | |
| 60 | + | |
| 61 | + "${ROOT}/restart.sh" reranker | |
| 62 | + wait_health "$backend" "$fmt" | |
| 63 | + | |
| 64 | + if ! "$PYTHON" "${ROOT}/scripts/benchmark_reranker_random_titles.py" \ | |
| 65 | + 100,200,400,600,800,1000 \ | |
| 66 | + --repeat 5 \ | |
| 67 | + --seed 42 \ | |
| 68 | + --quiet-runs \ | |
| 69 | + --timeout 360 \ | |
| 70 | + --tag "$tag" \ | |
| 71 | + --json-summary-out "$jf" | |
| 72 | + then | |
| 73 | + echo "[warn] benchmark exited non-zero for ${tag} (see ${jf} failed flag / partial runs)" >&2 | |
| 74 | + fi | |
| 75 | + | |
| 76 | + echo "artifact: $jf" | |
| 77 | +} | |
| 78 | + | |
| 79 | +run_one qwen3_vllm compact | |
| 80 | +run_one qwen3_vllm standard | |
| 81 | +run_one qwen3_vllm_score compact | |
| 82 | +run_one qwen3_vllm_score standard | |
| 83 | + | |
| 84 | +# Restore repo-default-style rerank settings (score + compact). | |
| 85 | +"$PYTHON" "${ROOT}/scripts/patch_rerank_vllm_benchmark_config.py" \ | |
| 86 | + --backend qwen3_vllm_score --instruction-format compact | |
| 87 | +"${ROOT}/restart.sh" reranker | |
| 88 | +wait_health qwen3_vllm_score compact | |
| 89 | +echo "Restored config: qwen3_vllm_score + compact. Done. Artifacts under ${OUT_DIR}" | ... | ... |