2a6d9d76
tangwang
更新了压测脚本和文档,让“单条请求...
|
1
2
3
|
# Local Translation Model Benchmark Report
测试脚本:
|
3abbc95a
tangwang
重构(scripts): 整理sc...
|
4
|
- [`benchmarks/translation/benchmark_translation_local_models.py`](/data/saas-search/benchmarks/translation/benchmark_translation_local_models.py)
|
2a6d9d76
tangwang
更新了压测脚本和文档,让“单条请求...
|
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
|
完整结果:
- Markdown:[`translation_local_models_extended_221846.md`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.md)
- JSON:[`translation_local_models_extended_221846.json`](/data/saas-search/perf_reports/20260318/translation_local_models/translation_local_models_extended_221846.json)
测试时间:
- `2026-03-18`
环境:
- GPU:`Tesla T4 16GB`
- Driver / CUDA:`570.158.01 / 12.8`
- Python env:`.venv-translator`
- Torch / Transformers:`2.10.0+cu128 / 5.3.0`
- 数据集:[`products_analyzed.csv`](/data/saas-search/products_analyzed.csv)
## Method
这轮把结果拆成 3 类:
- `batch_sweep`
固定 `concurrency=1`,比较 `batch_size=1/4/8/16/32/64`
- `concurrency_sweep`
固定 `batch_size=1`,比较 `concurrency=1/2/4/8/16/64`
- `batch x concurrency matrix`
组合压测,保留 `batch_size * concurrency <= 128`
统一设定:
- 关闭 cache:`--disable-cache`
- `batch_sweep`:每档 `256` items
- `concurrency_sweep`:每档 `32` requests
- `matrix`:每档 `32` requests
- 预热:`1` batch
复现命令:
```bash
cd /data/saas-search
|
3abbc95a
tangwang
重构(scripts): 整理sc...
|
42
|
./.venv-translator/bin/python benchmarks/translation/benchmark_translation_local_models.py \
|
2a6d9d76
tangwang
更新了压测脚本和文档,让“单条请求...
|
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
|
--suite extended \
--disable-cache \
--serial-items-per-case 256 \
--concurrency-requests-per-case 32 \
--concurrency-batch-size 1 \
--output-dir perf_reports/20260318/translation_local_models
```
## Key Results
### 1. 单流 batch sweep
| Model | Direction | Best batch | Best items/s | Batch 16 items/s | Batch 16 p95 ms |
|---|---|---:|---:|---:|---:|
| `nllb-200-distilled-600m` | `zh -> en` | `64` | `58.3` | `27.28` | `769.18` |
| `nllb-200-distilled-600m` | `en -> zh` | `64` | `32.64` | `13.52` | `1649.65` |
| `opus-mt-zh-en` | `zh -> en` | `64` | `70.15` | `41.44` | `797.93` |
| `opus-mt-en-zh` | `en -> zh` | `64` | `42.47` | `24.33` | `2098.54` |
解读:
- 纯吞吐上,4 个方向都在 `batch_size=64` 达到最高
- 如果还要兼顾更平衡的单 batch 延迟,`batch_size=16` 更适合作为默认 bulk 配置候选
- 本轮 bulk 吞吐排序:`opus-mt-zh-en` > `nllb zh->en` > `opus-mt-en-zh` > `nllb en->zh`
### 2. 单条请求并发 sweep
| Model | Direction | c=1 items/s | c=1 p95 ms | c=8 p95 ms | c=64 p95 ms |
|---|---|---:|---:|---:|---:|
| `nllb-200-distilled-600m` | `zh -> en` | `4.17` | `373.27` | `2383.8` | `7337.3` |
| `nllb-200-distilled-600m` | `en -> zh` | `2.16` | `670.78` | `3971.01` | `14139.03` |
| `opus-mt-zh-en` | `zh -> en` | `9.21` | `179.12` | `1043.06` | `3381.58` |
| `opus-mt-en-zh` | `en -> zh` | `3.6` | `1180.37` | `3632.99` | `7950.41` |
解读:
- `batch_size=1` 下,提高客户端并发几乎不提高吞吐,主要是把等待时间转成请求延迟
- 在线 query 翻译更适合低并发;`8+` 并发后,4 个方向的 p95 都明显恶化
- 在线场景里最稳的是 `opus-mt-zh-en`;最重的是 `nllb-200-distilled-600m en->zh`
### 3. batch x concurrency matrix
最高吞吐组合(在 `batch_size * concurrency <= 128` 约束下):
| Model | Direction | Batch | Concurrency | Items/s | Avg req ms | Req p95 ms |
|---|---|---:|---:|---:|---:|---:|
| `nllb-200-distilled-600m` | `zh -> en` | `64` | `2` | `53.95` | `2344.92` | `3562.04` |
| `nllb-200-distilled-600m` | `en -> zh` | `64` | `1` | `34.97` | `1829.91` | `2039.18` |
| `opus-mt-zh-en` | `zh -> en` | `64` | `1` | `52.44` | `1220.35` | `2508.12` |
| `opus-mt-en-zh` | `en -> zh` | `64` | `1` | `34.94` | `1831.48` | `2473.74` |
解读:
- 当前实现里,吞吐主要由 `batch_size` 决定,不是由 `concurrency` 决定
- 同一 `batch_size` 下,把并发从 `1` 拉高到 `2/4/8/...`,吞吐变化很小,但请求延迟会明显抬升
- 这说明当前本地翻译服务更接近“单 worker + GPU 串行处理”模型;容量规划不能指望靠客户端并发换吞吐
## Recommendation
- 在线 query 翻译优先看 `concurrency_sweep`,并把 `batch_size=1` 作为主指标口径
- 离线批量翻译优先看 `batch_sweep`,默认从 `batch_size=16` 起步,再按吞吐目标决定是否升到 `32/64`
- 如果继续沿用当前单 worker 架构,应把“允许的并发数”视为延迟预算问题,而不是吞吐扩容手段
|