# Local Translation Model Benchmark Report 测试脚本: - [`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py) 完整结果: - Markdown:[`translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md) - JSON:[`translation_local_models_extended_221846.json`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.json) 测试时间: - `2026-03-18` 环境: - GPU:`Tesla T4 16GB` - Driver / CUDA:`570.158.01 / 12.8` - Python env:`.venv-translator` - Torch / Transformers:`2.10.0+cu128 / 5.3.0` - 数据集:[`products_analyzed.csv`](/data/saas-search/products_analyzed.csv) ## Method 这轮把结果拆成 3 类: - `batch_sweep` 固定 `concurrency=1`,比较 `batch_size=1/4/8/16/32/64` - `concurrency_sweep` 固定 `batch_size=1`,比较 `concurrency=1/2/4/8/16/64` - `batch x concurrency matrix` 组合压测,保留 `batch_size * concurrency <= 128` 统一设定: - 关闭 cache:`--disable-cache` - `batch_sweep`:每档 `256` items - `concurrency_sweep`:每档 `32` requests - `matrix`:每档 `32` requests - 预热:`1` batch 复现命令: ```bash cd /data/saas-search ./.venv-translator/bin/python benchmarks/translation/benchmark_translation_local_models.py \ --suite extended \ --disable-cache \ --serial-items-per-case 256 \ --concurrency-requests-per-case 32 \ --concurrency-batch-size 1 \ --output-dir perf_reports/20260318/translation_local_models ``` ## Key Results ### 1. 单流 batch sweep | Model | Direction | Best batch | Best items/s | Batch 16 items/s | Batch 16 p95 ms | |---|---|---:|---:|---:|---:| | `nllb-200-distilled-600m` | `zh -> en` | `64` | `58.3` | `27.28` | `769.18` | | `nllb-200-distilled-600m` | `en -> zh` | `64` | `32.64` | `13.52` | `1649.65` | | `opus-mt-zh-en` | `zh -> en` | `64` | `70.15` | `41.44` | `797.93` | | `opus-mt-en-zh` | `en -> zh` | `64` | `42.47` | `24.33` | `2098.54` | 解读: - 纯吞吐上,4 个方向都在 `batch_size=64` 达到最高 - 如果还要兼顾更平衡的单 batch 延迟,`batch_size=16` 更适合作为默认 bulk 配置候选 - 本轮 bulk 吞吐排序:`opus-mt-zh-en` > `nllb zh->en` > `opus-mt-en-zh` > `nllb en->zh` ### 2. 单条请求并发 sweep | Model | Direction | c=1 items/s | c=1 p95 ms | c=8 p95 ms | c=64 p95 ms | |---|---|---:|---:|---:|---:| | `nllb-200-distilled-600m` | `zh -> en` | `4.17` | `373.27` | `2383.8` | `7337.3` | | `nllb-200-distilled-600m` | `en -> zh` | `2.16` | `670.78` | `3971.01` | `14139.03` | | `opus-mt-zh-en` | `zh -> en` | `9.21` | `179.12` | `1043.06` | `3381.58` | | `opus-mt-en-zh` | `en -> zh` | `3.6` | `1180.37` | `3632.99` | `7950.41` | 解读: - `batch_size=1` 下,提高客户端并发几乎不提高吞吐,主要是把等待时间转成请求延迟 - 在线 query 翻译更适合低并发;`8+` 并发后,4 个方向的 p95 都明显恶化 - 在线场景里最稳的是 `opus-mt-zh-en`;最重的是 `nllb-200-distilled-600m en->zh` ### 3. batch x concurrency matrix 最高吞吐组合(在 `batch_size * concurrency <= 128` 约束下): | Model | Direction | Batch | Concurrency | Items/s | Avg req ms | Req p95 ms | |---|---|---:|---:|---:|---:|---:| | `nllb-200-distilled-600m` | `zh -> en` | `64` | `2` | `53.95` | `2344.92` | `3562.04` | | `nllb-200-distilled-600m` | `en -> zh` | `64` | `1` | `34.97` | `1829.91` | `2039.18` | | `opus-mt-zh-en` | `zh -> en` | `64` | `1` | `52.44` | `1220.35` | `2508.12` | | `opus-mt-en-zh` | `en -> zh` | `64` | `1` | `34.94` | `1831.48` | `2473.74` | 解读: - 当前实现里,吞吐主要由 `batch_size` 决定,不是由 `concurrency` 决定 - 同一 `batch_size` 下,把并发从 `1` 拉高到 `2/4/8/...`,吞吐变化很小,但请求延迟会明显抬升 - 这说明当前本地翻译服务更接近“单 worker + GPU 串行处理”模型;容量规划不能指望靠客户端并发换吞吐 ## Recommendation - 在线 query 翻译优先看 `concurrency_sweep`,并把 `batch_size=1` 作为主指标口径 - 离线批量翻译优先看 `batch_sweep`,默认从 `batch_size=16` 起步,再按吞吐目标决定是否升到 `32/64` - 如果继续沿用当前单 worker 架构,应把“允许的并发数”视为延迟预算问题,而不是吞吐扩容手段