RESULTS.md 3.69 KB

Reranker benchmark: qwen3_vllm vs qwen3_vllm_score × instruction_format

Date: 2026-03-25
Host: single GPU (Tesla T4, ~16 GiB), CUDA 12.8 (see nvidia-smi during run).

Configuration (from config/config.yaml)

Shared across both backends for this run:

Key Value
model_name Qwen/Qwen3-Reranker-0.6B
max_model_len 160
infer_batch_size 100
sort_by_doc_length true
enable_prefix_caching true
enforce_eager false
dtype float16
tensor_parallel_size 1
gpu_memory_utilization 0.20
instruction Rank products by query with category & style match prioritized

qwen3_vllm uses vLLM generate + logprobs (.venv-reranker).
qwen3_vllm_score uses vLLM LLM.score() (.venv-reranker-score, pinned vLLM stack per reranker/README.md).

Methodology

  • Script: python scripts/benchmark_reranker_random_titles.py 100,200,400,600,800,1000 --repeat 5 with --seed 99 (see note below), --quiet-runs, --timeout 360.
  • Titles: default file /home/ubuntu/rerank_test/titles.1.8w (one title per line).
  • Query: default 健身女生T恤短袖.
  • Each scenario: 3 warm-up requests at n=400 (not timed), then 5 timed runs per n.
  • Metric: client wall time for POST /rerank (localhost), milliseconds.
  • After each services.rerank.backend / instruction_format change: ./restart.sh reranker, then GET /health until backend and instruction_format matched the intended scenario (extended reranker/server.py to expose instruction_format when the backend defines _instruction_format).

Note on RNG seed: With --seed 42, some runs occasionally lost one sample at n=600 (non-200 or transport error). All figures below use --seed 99 so every cell has 5/5 successful runs and comparable sampled titles.

Raw artifacts

JSON aggregates (means, stdev, raw values_ms): same directory, qwen3_vllm_{compact,standard}.json, qwen3_vllm_score_{compact,standard}.json.

Results — mean latency (ms)

backend instruction_format n=100 n=200 n=400 n=600 n=800 n=1000
qwen3_vllm compact 213.5 418.0 861.4 1263.4 1744.3 2162.2
qwen3_vllm standard 254.9 475.4 909.7 1353.2 1912.5 2406.7
qwen3_vllm_score compact 239.2 480.2 966.2 1433.5 1937.2 2428.4
qwen3_vllm_score standard 299.6 591.8 1178.9 1773.7 2341.6 2931.7

Short interpretation

  1. compact vs standard: For both backends, compact is faster on this setup (shorter / different chat template vs fixed yes/no system prompt + user block — see reranker/backends/qwen3_vllm.py / qwen3_vllm_score.py).
  2. qwen3_vllm vs qwen3_vllm_score: At n=1000, qwen3_vllm + compact is the fastest row (~2162 ms mean); qwen3_vllm_score + standard is the slowest (~2932 ms). Ordering can change on other GPUs / vLLM versions / batching.
  3. Repo default after tests: services.rerank.backend: qwen3_vllm_score, instruction_format: compact on both qwen3_vllm and qwen3_vllm_score blocks (patch script keeps them aligned).

Tooling added / changed

  • reranker/server.py: /health includes instruction_format when the active backend sets _instruction_format.
  • scripts/benchmark_reranker_random_titles.py: --tag, --json-summary-out, --quiet-runs.
  • scripts/patch_rerank_vllm_benchmark_config.py: surgical YAML patch (preserves newlines).
  • scripts/run_reranker_vllm_instruction_benchmark.sh: full matrix driver (continues if a benchmark exits non-zero; uses --timeout 360).