README.md 4.34 KB
Edit Raw Blame History


Local Translation Model Benchmark Report
测试脚本：


<code>scripts/benchmark_translation_local_models.py</code>


完整结果：


Markdown：<code>translation_local_models_extended_221846.md</code>
JSON：<code>translation_local_models_extended_221846.json</code>


测试时间：


2026-03-18


环境：


GPU：Tesla T4 16GB
Driver / CUDA：570.158.01 / 12.8
Python env：.venv-translator
Torch / Transformers：2.10.0+cu128 / 5.3.0
数据集：<code>products_analyzed.csv</code>

Method
这轮把结果拆成 3 类：


batch_sweep
固定 concurrency=1，比较 batch_size=1/4/8/16/32/64
concurrency_sweep
固定 batch_size=1，比较 concurrency=1/2/4/8/16/64
batch x concurrency matrix
组合压测，保留 batch_size * concurrency <= 128


统一设定：


关闭 cache：--disable-cache
batch_sweep：每档 256 items
concurrency_sweep：每档 32 requests
matrix：每档 32 requests
预热：1 batch


复现命令：
cd /data/saas-search
./.venv-translator/bin/python scripts/benchmark_translation_local_models.py \
  --suite extended \
  --disable-cache \
  --serial-items-per-case 256 \
  --concurrency-requests-per-case 32 \
  --concurrency-batch-size 1 \
  --output-dir perf_reports/20260318/translation_local_models

Key Results
1. 单流 batch sweep


Model
Direction
Best batch
Best items/s
Batch 16 items/s
Batch 16 p95 ms


nllb-200-distilled-600m
zh -> en
64
58.3
27.28
769.18


nllb-200-distilled-600m
en -> zh
64
32.64
13.52
1649.65


opus-mt-zh-en
zh -> en
64
70.15
41.44
797.93


opus-mt-en-zh
en -> zh
64
42.47
24.33
2098.54


解读：


纯吞吐上，4 个方向都在 batch_size=64 达到最高
如果还要兼顾更平衡的单 batch 延迟，batch_size=16 更适合作为默认 bulk 配置候选
本轮 bulk 吞吐排序：opus-mt-zh-en > nllb zh->en > opus-mt-en-zh > nllb en->zh

2. 单条请求并发 sweep


Model
Direction
c=1 items/s
c=1 p95 ms
c=8 p95 ms
c=64 p95 ms


nllb-200-distilled-600m
zh -> en
4.17
373.27
2383.8
7337.3


nllb-200-distilled-600m
en -> zh
2.16
670.78
3971.01
14139.03


opus-mt-zh-en
zh -> en
9.21
179.12
1043.06
3381.58


opus-mt-en-zh
en -> zh
3.6
1180.37
3632.99
7950.41


解读：


batch_size=1 下，提高客户端并发几乎不提高吞吐，主要是把等待时间转成请求延迟
在线 query 翻译更适合低并发；8+ 并发后，4 个方向的 p95 都明显恶化
在线场景里最稳的是 opus-mt-zh-en；最重的是 nllb-200-distilled-600m en->zh

3. batch x concurrency matrix
最高吞吐组合（在 batch_size * concurrency <= 128 约束下）：


Model
Direction
Batch
Concurrency
Items/s
Avg req ms
Req p95 ms


nllb-200-distilled-600m
zh -> en
64
2
53.95
2344.92
3562.04


nllb-200-distilled-600m
en -> zh
64
1
34.97
1829.91
2039.18


opus-mt-zh-en
zh -> en
64
1
52.44
1220.35
2508.12


opus-mt-en-zh
en -> zh
64
1
34.94
1831.48
2473.74


解读：


当前实现里，吞吐主要由 batch_size 决定，不是由 concurrency 决定
同一 batch_size 下，把并发从 1 拉高到 2/4/8/...，吞吐变化很小，但请求延迟会明显抬升
这说明当前本地翻译服务更接近“单 worker + GPU 串行处理”模型；容量规划不能指望靠客户端并发换吞吐

Recommendation

在线 query 翻译优先看 concurrency_sweep，并把 batch_size=1 作为主指标口径
离线批量翻译优先看 batch_sweep，默认从 batch_size=16 起步，再按吞吐目标决定是否升到 32/64
如果继续沿用当前单 worker 架构，应把“允许的并发数”视为延迟预算问题，而不是吞吐扩容手段