# Reranker benchmark: `qwen3_vllm` vs `qwen3_vllm_score` × `instruction_format`

**Date:** 2026-03-25  
**Host:** single GPU (Tesla T4, ~16 GiB), CUDA 12.8 (see `nvidia-smi` during run).

## Configuration (from `config/config.yaml`)

Shared across both backends for this run:

| Key | Value |
|-----|-------|
| `model_name` | `Qwen/Qwen3-Reranker-0.6B` |
| `max_model_len` | 160 |
| `infer_batch_size` | 100 |
| `sort_by_doc_length` | true |
| `enable_prefix_caching` | true |
| `enforce_eager` | false |
| `dtype` | float16 |
| `tensor_parallel_size` | 1 |
| `gpu_memory_utilization` | 0.20 |
| `instruction` | `Rank products by query with category & style match prioritized` |

`qwen3_vllm` uses vLLM **generate + logprobs** (`.venv-reranker`).  
`qwen3_vllm_score` uses vLLM **`LLM.score()`** (`.venv-reranker-score`, pinned vLLM stack per `reranker/README.md`).

## Methodology

- Script: `python scripts/benchmark_reranker_random_titles.py 100,200,400,600,800,1000 --repeat 5` with **`--seed 99`** (see note below), **`--quiet-runs`**, **`--timeout 360`**.
- Titles: default file `/home/ubuntu/rerank_test/titles.1.8w` (one title per line).
- Query: default `健身女生T恤短袖`.
- Each scenario: **3 warm-up** requests at `n=400` (not timed), then **5 timed** runs per `n`.
- Metric: **client wall time** for `POST /rerank` (localhost), milliseconds.
- After each `services.rerank.backend` / `instruction_format` change: `./restart.sh reranker`, then **`GET /health`** until `backend` and `instruction_format` matched the intended scenario (extended `reranker/server.py` to expose `instruction_format` when the backend defines `_instruction_format`).

**Note on RNG seed:** With `--seed 42`, some runs occasionally lost one sample at `n=600` (non-200 or transport error). All figures below use **`--seed 99`** so every cell has **5/5** successful runs and comparable sampled titles.

## Raw artifacts

JSON aggregates (means, stdev, raw `values_ms`): same directory, `qwen3_vllm_{compact,standard}.json`, `qwen3_vllm_score_{compact,standard}.json`.

## Results — mean latency (ms)

| backend | instruction_format | n=100 | n=200 | n=400 | n=600 | n=800 | n=1000 |
|---------|-------------------|------:|------:|------:|------:|------:|-------:|
| `qwen3_vllm` | `compact` | 213.5 | 418.0 | 861.4 | 1263.4 | 1744.3 | 2162.2 |
| `qwen3_vllm` | `standard` | 254.9 | 475.4 | 909.7 | 1353.2 | 1912.5 | 2406.7 |
| `qwen3_vllm_score` | `compact` | 239.2 | 480.2 | 966.2 | 1433.5 | 1937.2 | 2428.4 |
| `qwen3_vllm_score` | `standard` | 299.6 | 591.8 | 1178.9 | 1773.7 | 2341.6 | 2931.7 |

## Short interpretation

1. **`compact` vs `standard`:** For both backends, **`compact` is faster** on this setup (shorter / different chat template vs fixed yes/no system prompt + user block — see `reranker/backends/qwen3_vllm.py` / `qwen3_vllm_score.py`).
2. **`qwen3_vllm` vs `qwen3_vllm_score`:** At **`n=1000`**, **`qwen3_vllm` + `compact`** is the fastest row (~2162 ms mean); **`qwen3_vllm_score` + `standard`** is the slowest (~2932 ms). Ordering can change on other GPUs / vLLM versions / batching.
3. **Repo default** after tests: `services.rerank.backend: qwen3_vllm_score`, `instruction_format: compact` on **both** `qwen3_vllm` and `qwen3_vllm_score` blocks (patch script keeps them aligned).

## Tooling added / changed

- `reranker/server.py`: `/health` includes `instruction_format` when the active backend sets `_instruction_format`.
- `scripts/benchmark_reranker_random_titles.py`: `--tag`, `--json-summary-out`, `--quiet-runs`.
- `scripts/patch_rerank_vllm_benchmark_config.py`: surgical YAML patch (preserves newlines).
- `scripts/run_reranker_vllm_instruction_benchmark.sh`: full matrix driver (continues if a benchmark exits non-zero; uses `--timeout 360`).