perf_reports/20260318/translation_local_models/README.md

# Local Translation Model Benchmark Report
测试脚本：
- [`scripts/benchmark_translation_local_models.py`](/data/saas-search/scripts/benchmark_translation_local_models.py)
完整结果：
- Markdown：[`translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md)
- JSON：[`translation_local_models_extended_221846.json`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.json)
测试时间：
- `2026-03-18`
环境：
- GPU：`Tesla T4 16GB`
- Driver / CUDA：`570.158.01 / 12.8`
- Python env：`.venv-translator`
- Torch / Transformers：`2.10.0+cu128 / 5.3.0`
- 数据集：[`products_analyzed.csv`](/data/saas-search/products_analyzed.csv)
## Method
这轮把结果拆成 3 类：
- `batch_sweep`
  固定 `concurrency=1`，比较 `batch_size=1/4/8/16/32/64`
- `concurrency_sweep`
  固定 `batch_size=1`，比较 `concurrency=1/2/4/8/16/64`
- `batch x concurrency matrix`
  组合压测，保留 `batch_size * concurrency <= 128`
统一设定：
- 关闭 cache：`--disable-cache`
- `batch_sweep`：每档 `256` items
- `concurrency_sweep`：每档 `32` requests
- `matrix`：每档 `32` requests
- 预热：`1` batch
复现命令：
```bash
cd /data/saas-search
./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
  --suite extended \
  --disable-cache \
  --serial-items-per-case 256 \
  --concurrency-requests-per-case 32 \
  --concurrency-batch-size 1 \
  --output-dir perf_reports/20260318/translation_local_models
```
## Key Results
### 1. 单流 batch sweep
| Model | Direction | Best batch | Best items/s | Batch 16 items/s | Batch 16 p95 ms |
|---|---|---:|---:|---:|---:|
| `nllb-200-distilled-600m` | `zh -> en` | `64` | `58.3` | `27.28` | `769.18` |
| `nllb-200-distilled-600m` | `en -> zh` | `64` | `32.64` | `13.52` | `1649.65` |
| `opus-mt-zh-en` | `zh -> en` | `64` | `70.15` | `41.44` | `797.93` |
| `opus-mt-en-zh` | `en -> zh` | `64` | `42.47` | `24.33` | `2098.54` |
解读：
- 纯吞吐上，4 个方向都在 `batch_size=64` 达到最高
- 如果还要兼顾更平衡的单 batch 延迟，`batch_size=16` 更适合作为默认 bulk 配置候选
- 本轮 bulk 吞吐排序：`opus-mt-zh-en` > `nllb zh->en` > `opus-mt-en-zh` > `nllb en->zh`
### 2. 单条请求并发 sweep
| Model | Direction | c=1 items/s | c=1 p95 ms | c=8 p95 ms | c=64 p95 ms |
|---|---|---:|---:|---:|---:|
| `nllb-200-distilled-600m` | `zh -> en` | `4.17` | `373.27` | `2383.8` | `7337.3` |
| `nllb-200-distilled-600m` | `en -> zh` | `2.16` | `670.78` | `3971.01` | `14139.03` |
| `opus-mt-zh-en` | `zh -> en` | `9.21` | `179.12` | `1043.06` | `3381.58` |
| `opus-mt-en-zh` | `en -> zh` | `3.6` | `1180.37` | `3632.99` | `7950.41` |
解读：
- `batch_size=1` 下，提高客户端并发几乎不提高吞吐，主要是把等待时间转成请求延迟
- 在线 query 翻译更适合低并发；`8+` 并发后，4 个方向的 p95 都明显恶化
- 在线场景里最稳的是 `opus-mt-zh-en`；最重的是 `nllb-200-distilled-600m en->zh`
### 3. batch x concurrency matrix
最高吞吐组合（在 `batch_size * concurrency <= 128` 约束下）：
| Model | Direction | Batch | Concurrency | Items/s | Avg req ms | Req p95 ms |
|---|---|---:|---:|---:|---:|---:|
| `nllb-200-distilled-600m` | `zh -> en` | `64` | `2` | `53.95` | `2344.92` | `3562.04` |
| `nllb-200-distilled-600m` | `en -> zh` | `64` | `1` | `34.97` | `1829.91` | `2039.18` |
| `opus-mt-zh-en` | `zh -> en` | `64` | `1` | `52.44` | `1220.35` | `2508.12` |
| `opus-mt-en-zh` | `en -> zh` | `64` | `1` | `34.94` | `1831.48` | `2473.74` |
解读：
- 当前实现里，吞吐主要由 `batch_size` 决定，不是由 `concurrency` 决定
- 同一 `batch_size` 下，把并发从 `1` 拉高到 `2/4/8/...`，吞吐变化很小，但请求延迟会明显抬升
- 这说明当前本地翻译服务更接近“单 worker + GPU 串行处理”模型；容量规划不能指望靠客户端并发换吞吐
## Recommendation
- 在线 query 翻译优先看 `concurrency_sweep`，并把 `batch_size=1` 作为主指标口径
- 离线批量翻译优先看 `batch_sweep`，默认从 `batch_size=16` 起步，再按吞吐目标决定是否升到 `32/64`
- 如果继续沿用当前单 worker 架构，应把“允许的并发数”视为延迟预算问题，而不是吞吐扩容手段